Metadata-Version: 2.4
Name: synapse-core
Version: 0.5.4
Summary: Local RAG library — ingest files and SQLite, query semantically, pipe results into any AI agent
Project-URL: Homepage, https://github.com/adm-crow/synapse
Project-URL: Repository, https://github.com/adm-crow/synapse
Project-URL: Issues, https://github.com/adm-crow/synapse/issues
Author: adm-crow
License-Expression: Apache-2.0
License-File: LICENSE
Keywords: ai,chromadb,embeddings,llm,local-ai,nlp,rag,vector
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.11
Requires-Dist: chromadb
Requires-Dist: click
Requires-Dist: pypdf
Requires-Dist: python-docx
Requires-Dist: sentence-transformers
Provides-Extra: formats
Requires-Dist: beautifulsoup4; extra == 'formats'
Requires-Dist: ebooklib; extra == 'formats'
Requires-Dist: odfpy; extra == 'formats'
Requires-Dist: openpyxl; extra == 'formats'
Requires-Dist: python-pptx; extra == 'formats'
Provides-Extra: sentence
Requires-Dist: nltk; extra == 'sentence'
Description-Content-Type: text/markdown

<div align="center">
  <img src="logo.svg" alt="Synapse" width="130" /><br/><br/>

  <h1>⚡ synapse-core</h1>
  <p><strong>Local-first RAG for Python — ingest files, query semantically, feed any AI agent.</strong></p>

  [![CI](https://github.com/adm-crow/synapse/actions/workflows/ci.yml/badge.svg)](https://github.com/adm-crow/synapse/actions/workflows/ci.yml)
  [![tests](https://img.shields.io/badge/tests-107%20passing-brightgreen?style=flat-square)](tests/)
  [![PyPI](https://img.shields.io/pypi/v/synapse-core?style=flat-square&color=blue)](https://pypi.org/project/synapse-core/)
  [![Python](https://img.shields.io/badge/python-3.11%2B-blue?style=flat-square&logo=python&logoColor=white)](https://www.python.org/)
  [![License](https://img.shields.io/badge/license-Apache%202.0-green?style=flat-square)](LICENSE)
  [![Downloads](https://img.shields.io/pypi/dm/synapse-core?style=flat-square&color=orange)](https://pypi.org/project/synapse-core/)

</div>

---

**synapse** turns your local files and SQLite databases into a searchable vector store. No cloud, no API key, no infrastructure required.

```
Files / SQLite  ──►  Extract  ──►  Chunk  ──►  Embed  ──►  ChromaDB  ──►  Your AI Agent
```

| | Feature | |
|:---:|:---|:---|
| 📄 | **12 file formats** | `txt` `md` `csv` `pdf` `docx` `json` `jsonl` `html` `pptx` `xlsx` `epub` `odt` |
| 🗄️ | **SQLite ingestion** | Embed table records alongside files in the same collection |
| ✂️ | **Smart chunking** | Word-boundary and sentence-aware, configurable size & overlap |
| 🧠 | **Local embeddings** | `sentence-transformers` — no API key, fully offline |
| 🔁 | **Incremental ingestion** | SHA-256 hash — skip unchanged files on re-runs |
| 🔍 | **Semantic search** | Ranked results with scores, source path, and document metadata |
| 🖥️ | **CLI** | `synapse ingest`, `query --ai`, `sources`, `purge`, `reset` |
| 🤖 | **Agent-agnostic** | Works with Anthropic, OpenAI, Ollama — anything |

---

## Install

```bash
pip install synapse-core
# or
uv add synapse-core
```

Extra file formats (`.html` `.pptx` `.xlsx` `.epub` `.odt`) and sentence chunking:

```bash
pip install synapse-core[formats,sentence]
```

---

## Quick start

```python
from synapse_core import ingest, query

# Ingest a folder — persists to ./synapse_db by default
ingest("./docs")

# Query semantically
results = query("what is the refund policy?", n_results=4)
for r in results:
    print(f"[{r['score']:.2f}] {r['source']}")
    print(r['text'])
```

Each result is a plain dict:

```python
{
    "text":        "chunk content...",
    "source":      "/abs/path/to/file.txt",
    "source_type": "file",               # "file" or "sqlite"
    "score":       0.91,                 # relevance 0–1, higher is better
    "distance":    0.09,                 # raw ChromaDB L2 distance
    "chunk":       2,                    # index within the source document
    "doc_title":   "Company Policy",     # from PDF/DOCX/HTML/PPTX metadata
    "doc_author":  "Jane Doe",
    "doc_created": "2024-01-15T...",
}
```

> [!TIP]
> Run `ingest()` once — the collection persists on disk. Subsequent calls are idempotent (upsert, never duplicates). Use `incremental=True` to skip unchanged files.

---

## CLI

```bash
# Ingest
synapse ingest ./docs
synapse ingest ./docs --incremental          # skip unchanged files
synapse ingest ./docs --chunking sentence    # sentence-aware splitting
synapse ingest-sqlite ./data.db --table articles

# Query
synapse query "what is the refund policy?"   # raw chunks

# AI-powered answer — set your key first:
# macOS/Linux:        export ANTHROPIC_API_KEY="sk-ant-..."
# Windows PowerShell: $env:ANTHROPIC_API_KEY = "sk-ant-..."

synapse query "what is the refund policy?" --ai
synapse query "..." --ai --provider anthropic --model claude-sonnet-4-5
synapse query "..." --ai --provider openai   --model gpt-4o
synapse query "..." --ai --provider ollama   --model mistral

# Manage
synapse sources          # list all indexed sources
synapse purge            # remove chunks from deleted files
synapse reset --yes      # wipe the entire collection
```

Every command accepts `--db PATH` and `--collection NAME` to target a specific store. Run `synapse <command> --help` for all options.

---

## Connecting an AI agent

synapse handles retrieval — you wire it to any LLM. Full example with the **Anthropic SDK**:

```bash
pip install synapse-core anthropic
# macOS/Linux:        export ANTHROPIC_API_KEY="sk-ant-..."
# Windows PowerShell: $env:ANTHROPIC_API_KEY = "sk-ant-..."
```

```python
import anthropic
from anthropic.types import TextBlock
from synapse_core import ingest, query

ingest("./docs")  # run once

client = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY from env

def ask(question: str) -> str:
    chunks = query(question, n_results=4)
    context = "\n\n".join(r["text"] for r in chunks)

    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        system=(
            "You are a helpful assistant. "
            "Answer using ONLY the context below. "
            "If the answer is not in the context, say so.\n\n"
            f"CONTEXT:\n{context}"
        ),
        messages=[{"role": "user", "content": question}],
    )
    text_block = next((b for b in response.content if isinstance(b, TextBlock)), None)
    if text_block is None:
        raise RuntimeError("Anthropic response contained no text block.")
    return text_block.text

print(ask("What is the refund policy?"))
```

> [!NOTE]
> Swap `anthropic` for `openai`, `ollama`, or any other SDK — the `query()` call stays the same.

---

## API reference

<details>
<summary><strong>ingest()</strong></summary>

```python
ingest(
    source_dir      = "./docs",             # folder to scan (recursive)
    db_path         = "./synapse_db",       # ChromaDB persistence path
    collection_name = "synapse",
    chunk_size      = 1000,                 # target characters per chunk
    overlap         = 200,                  # overlap between consecutive chunks
    min_chunk_size  = 50,                   # discard chunks shorter than this
    embedding_model = "all-MiniLM-L6-v2",
    incremental     = False,                # skip unchanged files (SHA-256)
    chunking        = "word",               # "word" or "sentence" (requires [sentence])
    verbose         = True,
)
```

</details>

<details>
<summary><strong>ingest_sqlite()</strong></summary>

```python
ingest_sqlite(
    db_path         = "./data.db",
    table           = "articles",
    columns         = None,                 # columns to embed (None = all)
    id_column       = "id",                 # primary key for stable chunk IDs
    row_template    = None,                 # optional "{title}: {body}" format string
    chroma_path     = "./synapse_db",
    collection_name = "synapse",
    chunk_size      = 1000,
    overlap         = 200,
    min_chunk_size  = 50,
    embedding_model = "all-MiniLM-L6-v2",
    chunking        = "word",
    verbose         = True,
)
```

</details>

<details>
<summary><strong>query()</strong></summary>

```python
query(
    text            = "what is the refund policy?",
    db_path         = "./synapse_db",
    collection_name = "synapse",
    n_results       = 5,
    embedding_model = "all-MiniLM-L6-v2",  # must match the model used at ingest
)
```

Returns a list of dicts: `text`, `source`, `source_type`, `score`, `distance`, `chunk`, `doc_title`, `doc_author`, `doc_created`.

</details>

<details>
<summary><strong>purge() · reset() · sources()</strong></summary>

```python
from synapse_core import purge, reset, sources

sources()   # list all ingested source paths
purge()     # remove chunks whose source file no longer exists on disk
reset()     # delete the entire collection (irreversible)
```

All three accept `db_path` and `collection_name`.

```python
# Logging — colored INFO by default
import logging, synapse_core
synapse_core.setup_logging(log_file="ingest.log")       # persist to file
synapse_core.setup_logging(level=logging.DEBUG)          # more verbose
synapse_core.setup_logging(level=logging.CRITICAL)       # silence
```

</details>

---

## Architecture

```
synapse/
├── synapse_db/              ← ChromaDB writes here (auto-created)
└── synapse_core/
    ├── __init__.py          ← public API
    ├── cli.py               ← synapse ingest · ingest-sqlite · query · sources · purge · reset
    ├── pipeline.py          ← ingest() · query() · purge() · reset() · sources()
    ├── sqlite_ingester.py   ← ingest_sqlite()
    ├── extractors.py        ← 12 formats + document metadata extraction
    ├── chunker.py           ← word-boundary & sentence-aware chunking
    └── logger.py            ← colored logger · setup_logging()
```

---

## Roadmap

- [x] 12 file formats — `txt`, `md`, `pdf`, `docx`, `csv`, `json`, `jsonl`, `html`, `pptx`, `xlsx`, `epub`, `odt`
- [x] Word-boundary & sentence-aware chunking
- [x] Local embeddings — `sentence-transformers`, fully offline
- [x] ChromaDB — persistent vector store, zero config
- [x] Idempotent ingestion — upsert, never duplicates
- [x] Incremental ingestion — SHA-256 hash check
- [x] Document metadata — title, author, creation date
- [x] Collection management — `purge()`, `reset()`, `sources()`
- [x] SQLite ingestion — `ingest_sqlite()`
- [x] CI/CD — GitHub Actions, Python 3.11–3.13
- [x] PyPI release — `pip install synapse-core`
- [x] CLI — `synapse ingest`, `query --ai`, `purge`, `reset`, `sources`
- [ ] File watcher — `watch()` auto-ingest on change
- [ ] Pluggable embedders — OpenAI, Cohere, HuggingFace Inference API
- [ ] Pluggable vector stores — Qdrant, FAISS, Weaviate
- [ ] Re-ranking — cross-encoder re-ranking of retrieved chunks

---

<div align="center">
  <sub><a href="https://pypi.org/project/synapse-core/">PyPI</a> · <a href="LICENSE">Apache 2.0</a></sub>
</div>
