Cut Your AI Costs 50-80%

Stop loading entire files into your prompts. TokenShrink gives your AI only what it needs, compressed.

pip install tokenshrink[compression]
50-80%
Token Reduction
<50ms
Search Latency
~200ms
Compression Time

โŒ Without TokenShrink

  • Load entire file (5000 tokens)
  • $0.15 per query
  • Slow responses
  • Hit context limits
  • Irrelevant info confuses the model

โœ“ With TokenShrink

  • Load relevant chunks (200 tokens)
  • $0.03 per query
  • Fast responses
  • Stay under limits
  • Focused, accurate answers

Simple. Fast. Effective.

๐Ÿ”

Semantic Search

FAISS + MiniLM embeddings find exactly what's relevant to your query.

๐Ÿ—œ๏ธ

Smart Compression

LLMLingua-2 removes redundancy while preserving meaning.

โšก

Drop-in Ready

Works with OpenAI, Anthropic, LangChain, AutoGen โ€” any LLM stack.

๐ŸŽ

Apple Silicon

Native MPS acceleration on Mac. CUDA on NVIDIA.

๐Ÿ“

Auto-Indexing

Detects changed files automatically. No manual maintenance.

๐Ÿ”“

Open Source

MIT licensed. Self-host anywhere. No vendor lock-in.

from tokenshrink import TokenShrink

# Initialize and index your docs
ts = TokenShrink()
ts.index("./docs")

# Get compressed context for your LLM
result = ts.query("What are the API rate limits?")

print(result.context)   # Ready for your prompt
print(result.savings)   # "Saved 65% (1200 โ†’ 420 tokens)"

Works With REFRAG

Meta's REFRAG achieves 30x decode-time speedup by exploiting attention sparsity in RAG contexts. TokenShrink is the upstream complement โ€” we compress what enters the context window before decoding starts.

๐Ÿ“„ Paper ๐Ÿ’ป GitHub

Files โ†’ TokenShrink (50-80% fewer tokens) โ†’ LLM โ†’ REFRAG (30x faster decode)

Stack both for end-to-end savings across retrieval and inference.

Roadmap: REFRAG-Inspired

๐ŸŽฏ

Adaptive Compression

Vary ratio per chunk by information density. Low-value chunks get compressed harder.

๐Ÿ”ฒ

Cross-Passage Dedup

Exploit block-diagonal attention sparsity to remove redundancy across retrieved passages.

โš–๏ธ

Importance Scoring

Score chunks by estimated attention weight. Critical chunks stay full, the rest compress.

Feedback Welcome โ€” Humans & Agents

Using TokenShrink? We want to hear what works, what doesn't, and what you'd build next. Whether you're a developer or an AI agent, your input shapes the roadmap.

๐Ÿ’ฌ Discussions ๐Ÿ“ Give Feedback ๐Ÿ’ก Request Feature

Also from Musashi Labs

๐Ÿ›ก๏ธ Agent Guard

Security scanner for AI agent configurations. 20 rules, A-F scoring, CI/CD ready. Find exposed secrets, injection risks, and misconfigs before they ship.

npx @musashimiyamoto/agent-guard scan .
View Agent Guard โ†’