Metadata-Version: 2.4
Name: tokenshrink
Version: 0.1.0
Summary: Cut your AI costs 50-80%. FAISS retrieval + LLMLingua compression.
Project-URL: Homepage, https://tokenshrink.dev
Project-URL: Repository, https://github.com/MusashiMiyamoto1-cloud/tokenshrink
Project-URL: Documentation, https://tokenshrink.dev/docs
Author-email: Musashi <musashimiyamoto1@icloud.com>
License-Expression: MIT
License-File: LICENSE
Keywords: agents,ai,compression,context,cost-reduction,faiss,llm,llmlingua,rag,tokens
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Requires-Dist: faiss-cpu>=1.7.4
Requires-Dist: numpy>=1.24.0
Requires-Dist: sentence-transformers>=2.2.0
Provides-Extra: all
Requires-Dist: llmlingua>=0.2.0; extra == 'all'
Requires-Dist: pytest>=7.0.0; extra == 'all'
Requires-Dist: ruff>=0.1.0; extra == 'all'
Provides-Extra: compression
Requires-Dist: llmlingua>=0.2.0; extra == 'compression'
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == 'dev'
Requires-Dist: ruff>=0.1.0; extra == 'dev'
Description-Content-Type: text/markdown

# TokenShrink

**Cut your AI costs 50-80%.** FAISS semantic retrieval + LLMLingua compression.

Stop loading entire files into your prompts. Load only what's relevant, compressed.

## Quick Start

```bash
pip install tokenshrink

# Index your docs
tokenshrink index ./docs

# Get compressed context
tokenshrink query "What are the API limits?" --compress
```

## Why TokenShrink?

| Without | With TokenShrink |
|---------|------------------|
| Load entire file (5000 tokens) | Load relevant chunks (200 tokens) |
| $0.15 per query | $0.03 per query |
| Slow responses | Fast responses |
| Hit context limits | Stay under limits |

**Real numbers:** 50-80% token reduction on typical RAG workloads.

## Installation

```bash
# Basic (retrieval only)
pip install tokenshrink

# With compression (recommended)
pip install tokenshrink[compression]
```

## Usage

### CLI

```bash
# Index files
tokenshrink index ./docs
tokenshrink index ./src --extensions .py,.md

# Query (retrieval only)
tokenshrink query "How do I authenticate?"

# Query with compression
tokenshrink query "How do I authenticate?" --compress

# View stats
tokenshrink stats

# JSON output (for scripts)
tokenshrink query "question" --json
```

### Python API

```python
from tokenshrink import TokenShrink

# Initialize
ts = TokenShrink()

# Index your files
ts.index("./docs")

# Get compressed context
result = ts.query("What are the rate limits?")

print(result.context)      # Ready for your LLM
print(result.savings)      # "Saved 65% (1200 → 420 tokens)"
print(result.sources)      # ["api.md", "limits.md"]
```

### Integration Examples

**With OpenAI:**

```python
from tokenshrink import TokenShrink
from openai import OpenAI

ts = TokenShrink()
ts.index("./knowledge")

client = OpenAI()

def ask(question: str) -> str:
    # Get relevant, compressed context
    ctx = ts.query(question)
    
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": f"Context:\n{ctx.context}"},
            {"role": "user", "content": question}
        ]
    )
    
    print(f"Token savings: {ctx.savings}")
    return response.choices[0].message.content

answer = ask("What's the refund policy?")
```

**With LangChain:**

```python
from tokenshrink import TokenShrink
from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate

ts = TokenShrink()
ts.index("./docs")

def get_context(query: str) -> str:
    result = ts.query(query)
    return result.context

# Use in your chain
template = PromptTemplate(
    input_variables=["context", "question"],
    template="Context:\n{context}\n\nQuestion: {question}"
)
```

## How It Works

```
┌──────────┐     ┌───────────┐     ┌────────────┐
│  Files   │ ──► │  Indexer  │ ──► │ FAISS Index│
└──────────┘     │ (MiniLM)  │     └────────────┘
                 └───────────┘            │
                                          ▼
┌──────────┐     ┌───────────┐     ┌────────────┐
│ Question │ ──► │  Search   │ ──► │  Relevant  │
└──────────┘     │           │     │  Chunks    │
                 └───────────┘     └────────────┘
                                          │
                                          ▼
                               ┌────────────────┐
                               │  Compressor    │
                               │ (LLMLingua-2)  │
                               └────────────────┘
                                          │
                                          ▼
                               ┌────────────────┐
                               │ Optimized      │
                               │ Context        │
                               └────────────────┘
```

1. **Index**: Chunks your files, creates embeddings with MiniLM
2. **Search**: Finds relevant chunks via semantic similarity
3. **Compress**: Removes redundancy while preserving meaning

## Configuration

```python
ts = TokenShrink(
    index_dir=".tokenshrink",    # Where to store the index
    model="all-MiniLM-L6-v2",    # Embedding model
    chunk_size=512,              # Words per chunk
    chunk_overlap=50,            # Overlap between chunks
    device="auto",               # auto, mps, cuda, cpu
    compression=True,            # Enable LLMLingua
)
```

## Supported File Types

Default: `.md`, `.txt`, `.py`, `.json`, `.yaml`, `.yml`

Custom:
```bash
tokenshrink index ./src --extensions .py,.ts,.js,.md
```

## Performance

| Metric | Value |
|--------|-------|
| Index 1000 files | ~30 seconds |
| Search latency | <50ms |
| Compression | ~200ms |
| Token reduction | 50-80% |

## Requirements

- Python 3.10+
- 4GB RAM (8GB for compression)
- Apple Silicon: MPS acceleration
- NVIDIA: CUDA acceleration

## FAQ

**Q: Do I need LLMLingua?**  
A: No. Retrieval works without it (still saves 60-70% by loading only relevant chunks). Add compression for extra 20-30% savings.

**Q: Does it work with non-English?**  
A: Retrieval works well with multilingual content. Compression is English-optimized.

**Q: How do I update the index?**  
A: Just run `tokenshrink index` again. It detects changed files automatically.

## Uninstall

```bash
pip uninstall tokenshrink
rm -rf .tokenshrink  # Remove local index
```

---

Built by [Musashi](https://github.com/MusashiMiyamoto1-cloud) · Part of [Agent Guard](https://agentguard.co)
