Metadata-Version: 2.4
Name: infershrink
Version: 0.2.4
Summary: Cut LLM costs 80%+. Smart retrieval, prompt compression, model routing — one package.
Project-URL: Homepage, https://musashimiyamoto1-cloud.github.io/infershrink-site/
Project-URL: Documentation, https://musashimiyamoto1-cloud.github.io/infershrink-site/docs/
Project-URL: Pricing, https://musashimiyamoto1-cloud.github.io/infershrink-site/docs/pricing.html
Project-URL: Contact, https://musashimiyamoto1-cloud.github.io/infershrink-site/#contact
Author-email: Musashi Miyamoto <MusashiMiyamoto1@icloud.com>
License-Expression: Apache-2.0
License-File: LICENSE
Keywords: ai,anthropic,cost-optimization,faiss,llm,model-routing,openai,prompt-compression,retrieval
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.9
Provides-Extra: all
Requires-Dist: anthropic>=0.18; extra == 'all'
Requires-Dist: faiss-cpu>=1.7.4; extra == 'all'
Requires-Dist: llmlingua>=0.2.0; extra == 'all'
Requires-Dist: mcp>=1.8.0; extra == 'all'
Requires-Dist: numpy>=1.24.0; extra == 'all'
Requires-Dist: openai>=1.0; extra == 'all'
Requires-Dist: sentence-transformers>=2.2.0; extra == 'all'
Provides-Extra: anthropic
Requires-Dist: anthropic>=0.18; extra == 'anthropic'
Provides-Extra: compression
Requires-Dist: llmlingua>=0.2.0; extra == 'compression'
Provides-Extra: dev
Requires-Dist: mypy; extra == 'dev'
Requires-Dist: pytest-cov; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: ruff; extra == 'dev'
Provides-Extra: mcp
Requires-Dist: mcp>=1.8.0; extra == 'mcp'
Provides-Extra: openai
Requires-Dist: openai>=1.0; extra == 'openai'
Provides-Extra: retrieval
Requires-Dist: faiss-cpu>=1.7.4; extra == 'retrieval'
Requires-Dist: numpy>=1.24.0; extra == 'retrieval'
Requires-Dist: sentence-transformers>=2.2.0; extra == 'retrieval'
Description-Content-Type: text/markdown

# InferShrink

**One package to cut your LLM costs 80%+.**

Five capabilities, one install: **Retrieve → Compress → Classify → Route → Track.**

InferShrink wraps your OpenAI or Anthropic client to automatically compress prompts and route requests to the cheapest model that can handle the task. Add FAISS-powered semantic retrieval and LLMLingua compression for the full stack.

[![PyPI](https://img.shields.io/pypi/v/infershrink)](https://pypi.org/project/infershrink/)
[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](LICENSE)
[![Python](https://img.shields.io/pypi/pyversions/infershrink)](https://pypi.org/project/infershrink/)

## Installation

```bash
# Core (zero dependencies — routing + classification + tracking)
pip install infershrink

# With semantic retrieval (FAISS + sentence-transformers)
pip install infershrink[retrieval]

# With prompt compression (LLMLingua)
pip install infershrink[compression]

# Everything
pip install infershrink[all]
```

> **Note:** Retrieval features require Python ≥ 3.10. Core features work on 3.9+.

## Feature Matrix

| Feature | Install | Dependencies | Size |
|---------|---------|--------------|------|
| **Classify** | `pip install infershrink` | None | ~50KB |
| **Route** | `pip install infershrink` | None | ~50KB |
| **Track** | `pip install infershrink` | None | ~50KB |
| **Wrapper** | `pip install infershrink` | openai/anthropic (your existing client) | ~50KB |
| **Compress** | `pip install infershrink[compression]` | llmlingua | ~2GB model download |
| **Retrieve** | `pip install infershrink[retrieval]` | faiss-cpu, sentence-transformers | ~400MB model download |
| **Deduplicate** | `pip install infershrink[retrieval]` | (included with retrieval) | — |

**Core features** (classify, route, track, wrapper) have **zero dependencies** and work immediately. Optional features require additional installs and download ML models on first use.

## Quick Start — Simple (Model Routing)

Zero dependencies. One line of code.

```python
import openai
from infershrink import optimize

client = optimize(openai.Client())

# Use exactly as before — InferShrink handles the rest
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is 2+2?"}],
)
# ↑ This simple question was automatically routed to gpt-4o-mini
#   saving you ~95% on this request.
#   Complex tasks are kept on gpt-4o automatically.
```

**Streaming works transparently:**

```python
stream = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hi"}],
    stream=True,
)
for chunk in stream:
    print(chunk.choices[0].delta.content, end="")
# ↑ Still routed to gpt-4o-mini. Classification happens before the stream starts.
```

### Same-Provider Routing

InferShrink **never routes across providers**. Your OpenAI client stays on OpenAI. Your Anthropic client stays on Anthropic. No surprise 404s.

```
gpt-4o         → gpt-4o-mini       ✅ (same provider)
claude-opus    → claude-sonnet      ✅ (same provider)
gpt-4o         → claude-sonnet      ✗ (never happens)
```

## Full Stack — Retrieval + Compression + Routing

For maximum savings, combine semantic retrieval with model routing:

```python
from infershrink import TokenShrink, optimize
import openai

# 1. Build a retrieval index over your docs
ts = TokenShrink()
ts.index("./docs")

# 2. Retrieve compressed context
result = ts.query("What are the API rate limits?")
print(result.savings)  # "Saved 72% (1200 → 336 tokens)"

# 3. Route to cheapest capable model
client = optimize(openai.Client())
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "Answer using this context:\n" + result.context},
        {"role": "user", "content": "What are the API rate limits?"},
    ],
)
# Context was retrieved, compressed, and routed to the cheapest model.
```

## How It Works

```
                        ┌─────────────────────────────┐
    Your Code           │         InferShrink          │         LLM API
                        │                              │
                        │  1. Retrieve relevant docs   │
  ┌──────────┐   ──►   │     (FAISS + embeddings)     │
  │ question  │         │                              │
  │ docs/     │         │  2. Compress context         │
  └──────────┘         │     (LLMLingua / adaptive)   │   ──►  ┌──────────┐
                        │                              │        │ Cheapest │
                        │  3. Classify complexity      │        │ capable  │
                        │     (rule-based)             │   ◄──  │ model    │
                        │                              │        └──────────┘
                        │  4. Route to cheap model     │
                        │                              │
                        │  5. Track savings            │
                        └─────────────────────────────┘
```

### The Five Capabilities

1. **Retrieve** — FAISS semantic search finds the most relevant chunks from your docs
2. **Compress** — REFRAG-inspired adaptive compression: important chunks keep more detail, filler gets compressed harder. Cross-passage deduplication removes redundancy.
3. **Classify** — Rule-based classifier analyzes messages for complexity signals (code blocks, tool calls, length, sensitive keywords)
4. **Route** — Simple tasks go to cheap models (gpt-4o-mini), complex tasks stay on powerful ones (gpt-4o, claude-opus-4-6)
5. **Track** — Every request's cost savings are tracked so you can see your ROI

### Complexity Levels

| Level | Signals | Default Model |
|-------|---------|---------------|
| **SIMPLE** | Short messages, no code, basic questions (<500 tokens) | gpt-4o-mini |
| **MODERATE** | Some code, medium length, summarization | gpt-4o |
| **COMPLEX** | Heavy code, multi-step reasoning, long prompts | gpt-5.2 / claude-opus-4-6 |
| **SECURITY_CRITICAL** | Passwords, API keys, financial data | *(never downgraded, never compressed)* |

## CLI

```bash
# Routing commands (zero deps)
infershrink classify "What is 2+2?"
# → Complexity: SIMPLE

infershrink classify "Implement a red-black tree with delete rebalancing"
# → Complexity: COMPLEX

infershrink route "Hello world" --model gpt-4o
# → Original: gpt-4o → Routed: gpt-4o-mini ↓ Downgraded

infershrink route "Design a distributed consensus algorithm" --model gpt-4o
# → Original: gpt-4o → Routed: gpt-4o = No change

infershrink route "Hi" --model claude-opus-4-6
# → Original: claude-opus-4-6 → Routed: claude-sonnet ↓ Same-provider downgrade

# Retrieval commands (require `pip install infershrink[retrieval]`)
infershrink index ./docs              # Index files
infershrink query "your question"     # Retrieve + compress
infershrink search "your question"    # Search only (no compression)

# Management
infershrink stats                     # Lifetime cost tracking stats
infershrink clear                     # Clear retrieval index
```

## Configuration

Override any defaults:

```python
client = optimize(openai.Client(), config={
    "tiers": {
        "tier1": {"models": ["gpt-4o-mini"], "max_complexity": "SIMPLE"},
        "tier2": {"models": ["gpt-4o"], "max_complexity": "MODERATE"},
        "tier3": {"models": ["claude-opus-4-6"], "max_complexity": "COMPLEX"},
    },
    "compression": {
        "enabled": True,
        "min_tokens": 500,
        "skip_for": ["SECURITY_CRITICAL"],
    },
    "quality_floor": 0.95,
    "cost_tracking": True,
})
```

## Cost Tracking

```python
from infershrink import optimize

client = optimize(openai.Client())

# ... make some API calls ...

# View savings
print(client.infershrink_tracker.summary())
# InferShrink Session Stats
# ========================================
# Total requests:       42
# Requests downgraded:  38
# Requests compressed:  25
# Original tokens:      156,000
# Compressed tokens:    98,000
# Tokens saved:         58,000 (37.2%)
# Estimated savings:    $2.3400
# ========================================

# Programmatic access
stats = client.infershrink_tracker.stats()
print(f"Saved ${stats.total_estimated_savings_usd:.2f}")
```

## Anthropic Support

Works the same way:

```python
import anthropic
from infershrink import optimize

client = optimize(anthropic.Anthropic())

response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Summarize this article..."}],
)
```

## Retrieval Features

The integrated retrieval engine (formerly TokenShrink) provides:

- **FAISS semantic search** — Find relevant chunks from your indexed docs
- **Adaptive compression** — REFRAG-inspired per-chunk compression ratios based on importance
- **Cross-passage deduplication** — Remove near-duplicate chunks via embedding similarity
- **Importance scoring** — Combined similarity + information density scoring

```python
from infershrink import TokenShrink

ts = TokenShrink(
    index_dir=".my-index",
    chunk_size=512,
    adaptive=True,      # REFRAG-inspired adaptive compression
    dedup=True,          # Cross-passage deduplication
)

# Index your docs
ts.index("./docs")

# Query with full scoring
result = ts.query("What are the constraints?", k=5)
print(result.context)          # Retrieved + compressed context
print(result.savings)          # "Saved 65% (2400 → 840 tokens, 1 redundant chunks removed)"
print(result.chunk_scores)     # Per-chunk importance scores
```

## Comparison

| Feature | InferShrink | RouteLLM | Burnwise |
|---------|-------------|----------|----------|
| One-line integration | ✅ | ❌ | ❌ |
| Semantic retrieval (FAISS) | ✅ | ❌ | ❌ |
| Prompt compression (LLMLingua) | ✅ | ❌ | ❌ |
| Adaptive compression | ✅ | ❌ | ❌ |
| Model routing | ✅ | ✅ | ✅ |
| Cost tracking | ✅ | ❌ | ✅ |
| Security-aware | ✅ | ❌ | ❌ |
| Zero dependencies (core) | ✅ | ❌ | ❌ |
| OpenAI + Anthropic | ✅ | ✅ | ❌ |
| CLI | ✅ | ❌ | ❌ |

## API Reference

### `optimize(client, config=None)`

Wrap an OpenAI or Anthropic client with InferShrink optimizations.

- **client** — An `openai.Client()` or `anthropic.Anthropic()` instance
- **config** — Optional dict to override default configuration
- **Returns** — A wrapped client with the same interface

### `classify(messages)`

Classify message complexity without wrapping a client.

```python
from infershrink import classify

result = classify([{"role": "user", "content": "Hello!"}])
print(result.complexity)  # Complexity.SIMPLE
```

### `TokenShrink(index_dir=None, model="all-MiniLM-L6-v2", ...)`

Semantic retrieval + compression engine. Requires `pip install infershrink[retrieval]`.

- `.index(path)` → Index files for retrieval
- `.query(question, k=5)` → Retrieve + compress relevant context → `ShrinkResult`
- `.search(question, k=5)` → Search without compression
- `.stats()` → Index statistics
- `.clear()` → Clear the index

### `Tracker`

Access via `client.infershrink_tracker`:

- `.stats()` → `SessionStats` dataclass
- `.summary()` → Human-readable string
- `.reset()` → Clear all tracked data

## Pricing

InferShrink works **without a license key** — you get all features in dev mode.

| | Dev (no key) | Free | Pro — $19/mo | Team — $49/mo |
|---|---|---|---|---|
| Requests/mo | Unlimited | 1,000 | 50,000 | 500,000 |
| Model routing | ✅ | ✅ | ✅ | ✅ |
| Prompt compression | ✅ | — | ✅ | ✅ |
| FAISS retrieval | ✅ | — | ✅ | ✅ |
| Cost tracking | ✅ | Basic | Full | Full |

```python
# Set your license key via environment variable
# export INFERSHRINK_LICENSE_KEY=ls_live_xxxxx

# Or pass directly
client = optimize(openai.Client(), config={"license_key": "ls_live_xxxxx"})
```

## License

Apache 2.0 — see [LICENSE](LICENSE).

## Links

- **Source**: [github.com/MusashiMiyamoto1-cloud/infershrink](https://github.com/MusashiMiyamoto1-cloud/infershrink)
- **PyPI**: [pypi.org/project/infershrink](https://pypi.org/project/infershrink/)
