Metadata-Version: 2.4
Name: torbuquant
Version: 0.1.0
Summary: TurboQuant implementation
Project-URL: Homepage, https://github.com/turboquant/torbuquant
Project-URL: Documentation, https://github.com/turboquant/torbuquant#readme
Project-URL: Repository, https://github.com/turboquant/torbuquant
Project-URL: Issues, https://github.com/turboquant/torbuquant/issues
Keywords: llm,quantization,kv-cache,transformers,pytorch,attention,inference,memory-optimization
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Typing :: Typed
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.21.0
Requires-Dist: scipy>=1.7.0
Requires-Dist: torch>=2.0.0
Provides-Extra: triton
Requires-Dist: triton>=2.0.0; extra == "triton"
Provides-Extra: llm
Requires-Dist: transformers>=4.40.0; extra == "llm"
Requires-Dist: accelerate>=0.20.0; extra == "llm"
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: mypy>=1.0.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Provides-Extra: all
Requires-Dist: triton>=2.0.0; extra == "all"
Requires-Dist: transformers>=4.40.0; extra == "all"
Requires-Dist: accelerate>=0.20.0; extra == "all"
Requires-Dist: datasets>=2.0.0; extra == "all"
Dynamic: license-file

<p align="center">
  <img src="docs/assets/banner.png" alt="TurboQuant — online vector quantization for KV cache and retrieval" width="720"/>
</p>

<h1 align="center">TurboQuant</h1>

<p align="center">
  <strong>Online vector quantization with near-optimal distortion rate</strong><br/>
</p>

<p align="center">
  <a href="https://www.python.org/downloads/"><img src="https://img.shields.io/badge/python-3.10%2B-blue?logo=python&logoColor=white" alt="Python 3.10+"/></a>
  <a href="https://pytorch.org/"><img src="https://img.shields.io/badge/PyTorch-2.0%2B-ee4c2c?logo=pytorch&logoColor=white" alt="PyTorch 2.0+"/></a>
  <img src="https://img.shields.io/badge/license-GPL--3.0-green" alt="License GPL-3.0"/>
  <img src="https://img.shields.io/badge/tests-passing-brightgreen" alt="Tests passing"/>
  <img src="https://img.shields.io/badge/docs-mermaid%20%7C%20markdown-informational" alt="Docs"/>
</p>
<p align="center">
  <img src="https://badge.fury.io/py/torbuquant.svg" alt="PyPI version"/>
</p>

TorbuQuant implements the [TurboQuant](https://arxiv.org/abs/2504.19874) paper algorithms for compressing key-value caches in transformer-based language models. It achieves **2.5-4.7x memory compression** while preserving model quality.

## Key Results

| Configuration | Memory Compression | Quality (PPL)             | Throughput Gain                  |
| ------------- | ------------------ | ------------------------- | -------------------------------- |
| K4V8          | 2.56x              | Preserved (8.787 → 8.792) | 1.37-1.68x under memory pressure |
| K4V4          | 3.76x              | Preserved                 | Memory-focused                   |
| K4V2          | 4.74x              | Experimental              | Maximum compression              |

### K4V8, K4V4, K4V2 Naming

The names describe bit-widths for Keys and Values:

```
K4V8 = 4-bit Keys + 8-bit Values
K4V4 = 4-bit Keys + 4-bit Values
K4V2 = 4-bit Keys + 2-bit Values
```

## Installation

```bash
pip install torbuquant
```

With optional dependencies:

```bash
# For Triton GPU kernels (recommended)
pip install torbuquant[triton]

# For HuggingFace Transformers integration
pip install torbuquant[llm]

# Everything
pip install torbuquant[all]
```

## Quick Start

### Basic Quantization

```python
import torch
from torbuquant import TurboQuantMSE

# Create quantizer
quantizer = TurboQuantMSE(dim=128, bits=4, device="cuda")

# Quantize keys
keys = torch.randn(1, 32, 1024, 128, device="cuda")  # (B, H, N, D)
compressed = quantizer.quantize(keys)

# Reconstruct
reconstructed = quantizer.dequantize(compressed)

print(f"Compression: {keys.numel() * 4 / compressed.indices.numel():.2f}x")
```

### Fused Attention (Recommended)

```python
from torbuquant import FusedTurboQuantAttention

# Create fused attention module
fused_attn = FusedTurboQuantAttention(
    dim=128,
    k_bits=4,
    v_bits=8,  # K4V8 is the speed-positive mode
    device="cuda",
)

# Quantize KV cache
keys = torch.randn(1, 32, 1024, 128, device="cuda")
values = torch.randn(1, 32, 1024, 128, device="cuda")
compressed_kv = fused_attn.quantize_kv(keys, values)

# Decode attention (single query)
query = torch.randn(1, 32, 1, 128, device="cuda")
output = fused_attn(query, compressed_kv)

# Check memory
mem = compressed_kv.memory_bytes()
print(f"Compressed KV: {mem['total'] / 1024**2:.2f} MiB")
```

### Auto Policy Selection

```python
from torbuquant import AutoPolicyInput, choose_auto_kv_policy

policy = choose_auto_kv_policy(AutoPolicyInput(
    batch_size=8,
    num_layers=32,
    num_kv_heads=8,
    head_dim=128,
    context_tokens=4096,
    available_kv_memory_bytes=512 * 1024**2,  # 512 MiB budget
))

print(policy)
# TurboQuantKVPolicy(
#     backend='turboquant',
#     k_bits=4,
#     v_bits=8,
#     reason='K4V8 fits the KV budget and is the measured throughput-positive mode.'
# )
```

### HuggingFace Transformers Integration

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from torbuquant.llm import TurboQuantDynamicCache

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-3B",
    torch_dtype=torch.bfloat16,
    device_map="cuda",
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B")

# Create TurboQuant cache
cache = TurboQuantDynamicCache(
    num_hidden_layers=model.config.num_hidden_layers,
    head_dim=model.config.hidden_size // model.config.num_attention_heads,
    key_bits=4,
    value_bits=8,
    key_method="mse",  # Use MSE for quality preservation
    recent_tokens=128,  # Keep recent tokens uncompressed
    device="cuda",
)

# Generate with compressed KV cache
inputs = tokenizer("Hello, how are you?", return_tensors="pt").to("cuda")
outputs = model.generate(
    **inputs,
    past_key_values=cache,
    max_new_tokens=100,
    use_cache=True,
)
print(tokenizer.decode(outputs[0]))
```

## Algorithm Overview

TurboQuant uses a two-stage process:

1. **Random Rotation**: Multiplies vectors by a random orthogonal matrix, spreading energy uniformly across dimensions. This makes each coordinate follow a Beta distribution.

2. **Lloyd-Max Quantization**: Applies optimal scalar quantization to each rotated coordinate using pre-computed codebooks.

For unbiased inner products (vector search), an optional QJL residual correction is applied.

### MSE Distortion Bounds

| Bits | MSE Constant | Lower Bound | Compression |
| ---- | ------------ | ----------- | ----------- |
| 1    | 0.360        | 0.250       | 16x         |
| 2    | 0.117        | 0.0625      | 8x          |
| 3    | 0.030        | 0.0156      | 5.3x        |
| 4    | 0.009        | 0.0039      | 4x          |

## When to Use Each Mode

| Situation                           | Recommended Mode    | Why                           |
| ----------------------------------- | ------------------- | ----------------------------- |
| Single request, enough VRAM         | SDPA/FlashAttention | Fastest per-token             |
| Batch serving under memory pressure | TurboQuant K4V8     | 1.37-1.68x throughput gain    |
| Long context, memory-critical       | TurboQuant K4V4     | Best compression with quality |
| Experimental/research               | TurboQuant K4V2     | Maximum compression           |

## API Reference

### Core Classes

- `TurboQuantMSE`: MSE-optimal quantizer (Algorithm 1)
- `TurboQuantProd`: Unbiased inner-product quantizer (Algorithm 2)
- `FusedTurboQuantAttention`: Fused decode attention with quantized KV
- `TurboQuantDynamicCache`: HuggingFace cache integration

### Policy Functions

- `choose_auto_kv_policy()`: Auto-select SDPA vs TurboQuant
- `choose_kv_policy()`: Manual policy selection by objective
- `estimate_dense_kv_bytes()`: Estimate uncompressed KV size
- `estimate_turboquant_kv_bytes()`: Estimate compressed KV size

## Benchmarks

Tested on RTX 4090 Laptop GPU, H=32, D=128:

### Single-Request Latency

| Seq Len | SDPA FP16 | K4V8     | K4V4     |
| ------- | --------- | -------- | -------- |
| 2048    | 0.047 ms  | 0.160 ms | 0.263 ms |
| 4096    | 0.084 ms  | 0.308 ms | 0.453 ms |
| 8192    | 0.253 ms  | 0.533 ms | 0.895 ms |

### Fixed-Memory Throughput (512 MiB KV budget)

| Context | SDPA Batch | K4V8 Batch | K4V8 Throughput Gain |
| ------- | ---------- | ---------- | -------------------- |
| 2048    | 16         | 40         | 1.68x                |
| 4096    | 8          | 20         | 1.58x                |
| 8192    | 4          | 10         | 1.37x                |

## Citation

```bibtex
@inproceedings{turboquant2026,
    title={TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate},
    author={Zandieh, Amir and Daliri, Majid and Hadian, Majid and Mirrokni, Vahab},
    booktitle={ICLR},
    year={2026},
    url={https://arxiv.org/abs/2504.19874}
}
```

## License

MIT License. See [LICENSE](LICENSE) for details.

## Contributing

Contributions welcome! Please read [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.

## Acknowledgments

This implementation is based on the TurboQuant paper by Zandieh et al. (ICLR 2026).
