Metadata-Version: 2.4
Name: blackmagic-retrieval
Version: 0.1.0
Summary: Cross-lingual sparse retrieval + reasoning + GA imagination over typed concept anchors (EN/JA/KO).
Author-email: Eden Duthie <edenduthie@agenticx.org>
License-Expression: Apache-2.0
Project-URL: Homepage, https://huggingface.co/cp500/opensearch-neural-sparse-en-jp-ko
Project-URL: Model, https://huggingface.co/cp500/opensearch-neural-sparse-en-jp-ko
Project-URL: Dataset, https://huggingface.co/datasets/cp500/multilingual-concept-training-kit
Keywords: sparse-retrieval,splade,cross-lingual,multilingual,concept-bottleneck,dempster-shafer,mcts,genetic-algorithm,rag,retrieval
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Operating System :: OS Independent
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=2.0
Requires-Dist: transformers>=4.40
Requires-Dist: numpy>=1.24
Requires-Dist: huggingface_hub>=0.20
Provides-Extra: dev
Requires-Dist: pytest>=7; extra == "dev"
Requires-Dist: pytest-cov>=4; extra == "dev"
Requires-Dist: build; extra == "dev"
Requires-Dist: twine; extra == "dev"
Dynamic: license-file

# BlackMagic

Cross-lingual sparse retrieval + reasoning + GA imagination over typed
concept anchors. Default encoder is a multilingual SPLADE fine-tune
([`cp500/opensearch-neural-sparse-en-jp-ko`](https://huggingface.co/cp500/opensearch-neural-sparse-en-jp-ko))
so an English-authored schema retrieves across EN / JA / KO.

## Install

```bash
pip install blackmagic-retrieval
```

The multilingual SPLADE model (~670MB) is downloaded from HuggingFace on
first use and cached by the `transformers` library. For development:

```bash
git clone <repo>
pip install -e '.[dev]'
PYTHONPATH=src pytest tests/
```

## Quickstart

```python
from blackmagic import BlackMagic, BlackMagicConfig

bm = BlackMagic(BlackMagicConfig(
    schema_path="examples/automotive_schema.json",
    db_path=":memory:",
))

bm.ingest([
    {"text": "Toyota announced a $13.6B investment in battery production.",
     "id": "d1", "timestamp": "2026-03-01"},
    {"text": "Honda launches new EV in partnership with CATL.",
     "id": "d2", "timestamp": "2026-03-15"},
])

# Sparse retrieval with persona valence
result = bm.search("automakers investing in batteries", persona="investor")
for inf in result.infons[:5]:
    print(inf.subject, inf.predicate, inf.object, inf.confidence)

# Dempster-Shafer claim verification
v = bm.verify_claim("Toyota is aggressively investing in batteries.")
print(v.label, v.belief_supports, v.belief_refutes)

# MCTS multi-hop reasoning
m = bm.reason("Does the industry face supply risks?")
print(m.verdict, m.chains_discovered)

# GA imagination — MCTS-shaped output with dual verdicts
im = bm.imagine("What OEM–supplier partnerships might emerge?")
print(im.verdict, im.mcts_verdict)
for inf in im.imagined_infons[:5]:
    print(inf.subject, inf.predicate, inf.object,
          "fitness=", inf.fitness,
          "parents=", inf.parent_infon_ids)
```

## Features

- **Sparse retrieval** via splade-tiny → typed anchor projection
- **Persona valence** — investor / engineer / executive / regulator / analyst
- **Contrary views** — invert the evidential lens at query time
- **Temporal graph** — NEXT edges link facts across time per shared anchor
- **Constraint aggregation** — cross-document infon fusion
- **Dempster-Shafer** claim verification
- **Graph MCTS** for multi-hop reasoning
- **GA imagination** (new) — query-scoped genetic algorithm that proposes
  plausible counterfactual infons scored by grammar × logic × health,
  with output isomorphic to `MCTSResult`

## When to use cognition vs BlackMagic

| | cognition | BlackMagic |
|---|---|---|
| Languages | EN + JA/KO/ZH/...  | EN default, EN/JA/KO via multilingual flag |
| Encoder | splade-tiny or multilingual XLM-R | splade-tiny bundled; any HF SPLADE via config |
| Structural analysis (Kano, Kan, etc.) | Yes | No |
| Category theory extensions | Yes | No |
| Cloud backend (DynamoDB, Lambda) | Yes | No |
| MCP / agent tooling | Yes | No |
| GA imagination | No | Yes |
| Line count | ~5,700 | ~3,900 |

## Multilingual (EN / JA / KO)

Set `multilingual=True` to use the fine-tuned
[`cp500/opensearch-neural-sparse-en-jp-ko`](https://huggingface.co/cp500/opensearch-neural-sparse-en-jp-ko)
model. A Japanese or Korean sentence activates the same English anchor
positions as its English parallel — you write your schema once, in English,
and ingestion / search / verify / imagine all work across all three languages.

```python
cfg = BlackMagicConfig(
    schema_path="examples/automotive_schema.json",
    model_name="cp500/opensearch-neural-sparse-en-jp-ko",
    multilingual=True,         # self-encode anchors + exclusivity filter
    activation_threshold=0.25, # looser than splade-tiny's 0.3
    min_confidence=0.15,
)
bm = BlackMagic(cfg)

bm.ingest([
    {"id": "en1", "text": "Chevron announced a $15B investment in the Permian Basin.",
     "timestamp": "2026-04-01"},
    {"id": "ja1", "text": "シェブロンはパーミアン盆地で150億ドルの投資を発表した。",
     "timestamp": "2026-04-01"},
    {"id": "ko1", "text": "셰브런은 퍼미안 분지에 150억 달러 규모의 투자를 발표했다.",
     "timestamp": "2026-04-01"},
])
# All three docs produce infons; a JA query can retrieve KO evidence and vice versa.
```

**How it works.** The multilingual model doesn't fire on literal English
token IDs; it expands every anchor string (including JA/KO parallels) into
the same multilingual subword soup (Latin + CJK + Cyrillic + Arabic). On init,
BlackMagic self-encodes each anchor's surface forms through SPLADE, keeps
its top-K expansion positions, and subtracts positions that also activate for
another same-type anchor (crosstalk filter).

**Benchmark.** On 200 held-out concepts × 9 language pairs (1,800 query×passage
pairs) from `cp500/multilingual-concept-training-kit`:

| | MRR@10 | Recall@10 |
|---|---:|---:|
| en→en | 1.000 | 1.000 |
| en→ja | 0.995 | 1.000 |
| ja→en | 0.998 | 1.000 |
| ko→en | 0.995 | 1.000 |
| ko→ko | 0.998 | 1.000 |
| **OVERALL** | **0.996** | **1.000** |

EN-vocab ratio on top-50 dims: en 0.57, ja 0.55, ko 0.55 — JA/KO queries
project into English-like vocab positions at ~97% of English's rate.

## Testing

```bash
# splade-tiny only (fast, CI-friendly)
PYTHONPATH=src pytest tests/

# including multilingual integration (downloads ~600MB model)
PYTHONPATH=src RUN_ML_TESTS=1 pytest tests/

# kit-scale retrieval benchmark
PYTHONPATH=src python examples/benchmark_multilingual.py
```

## License

Apache 2.0.
