Metadata-Version: 2.4
Name: sigularty
Version: 0.1.1
Summary: Production-grade PyTorch model compression: pruning, LRF, clustering, quantization, GPTQ.
Author: sigularty
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Cython
Classifier: License :: Other/Proprietary License
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: torch>=2.0
Requires-Dist: torchvision>=0.15
Requires-Dist: torchmetrics>=0.11
Requires-Dist: tqdm
Requires-Dist: numpy
Requires-Dist: matplotlib
Requires-Dist: Cython>=3.0
Provides-Extra: pruning
Requires-Dist: torch-pruning>=1.3; extra == "pruning"
Provides-Extra: full
Requires-Dist: torch-pruning>=1.3; extra == "full"
Requires-Dist: onnx; extra == "full"
Requires-Dist: onnxruntime; extra == "full"
Dynamic: author
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# compressiontoolkit

Compress any PyTorch model — smaller, faster, with minimal accuracy loss.

Runs a configurable pipeline of compression techniques on any architecture
(CNNs, Transformers, NLP models, custom models) and produces a compressed
model with a full visual evaluation report.

---

## Install

```bash
pip install compressiontoolkit
```

With structured pruning support:
```bash
pip install "compressiontoolkit[pruning]"
```

Full install (pruning + ONNX export):
```bash
pip install "compressiontoolkit[full]"
```

---

## Quickstart

```python
import torch
from torch.utils.data import DataLoader
from compressiontoolkit import compress, analyze

# Any PyTorch model + any DataLoader
result = compress(model, dataloader)

# Use the compressed model as a drop-in replacement
compressed_model = result.model

print(result)
# CompressionResult(ratio=3.8x, accuracy_retention=96.4%, size=90.6→23.8 MB, speedup=2.1x, cqi=1.94)
```

---

## API Reference

### `compress(model, dataloader, **kwargs) → CompressionResult`

Compresses a model through a configurable pipeline. The original model is
never modified — a deep copy is made internally.

#### Required arguments

| Argument | Type | Description |
|---|---|---|
| `model` | `nn.Module` | Any PyTorch model |
| `dataloader` | `DataLoader` | Yields `(inputs, labels)` batches. Used for calibration, fine-tuning, and evaluation. |

#### Technique flags

All techniques are enabled by default. Set to `False` to skip.

| Argument | Default | Description |
|---|---|---|
| `use_pruning` | `False` | Structured filter pruning via DepGraph |
| `use_lrf` | `True` | Low-rank factorization (SVD) |
| `use_clustering` | `True` | Weight clustering (k-means) |
| `use_kd_finetune` | `True` | Knowledge distillation fine-tuning |
| `use_quantization` | `True` | Precision reduction (fp16 / INT8) |
| `use_gptq` | `False` | GPTQ INT4 Hessian-corrected quantization |

#### Key hyperparameters

| Argument | Default | Description |
|---|---|---|
| `device` | auto | `'cuda'` or `'cpu'`. Auto-detected. |
| `num_classes` | `10` | Output class count of your model |
| `lrf_epsilon` | `0.5` | Rank ratio for SVD. Lower = more compression. |
| `lrf_adaptive` | `False` | Derive optimal epsilon per-layer from SVD energy |
| `num_clusters` | `16` | k-means cluster count. Try 8, 16, 32. |
| `quant_mode` | `'fp16'` | `'fp16'` (GPU), `'dynamic'` (CPU INT8), `'static'` |
| `pruning_ratio` | `0.3` | Fraction of filters to remove globally |
| `kd_epochs` | `3` | Fine-tuning epochs after compression |
| `kd_temperature` | `4.0` | Distillation temperature |
| `save_report` | `True` | Save a PNG visualization report |
| `report_path` | `'compression_report.png'` | Output path for the report |

#### Hyperparameter search

Set these to `True` to run an automatic search before the pipeline runs.

| Argument | Default | Description |
|---|---|---|
| `find_optimal_epsilon` | `False` | Search for the best LRF epsilon |
| `find_optimal_pruning` | `False` | Search for the best pruning ratio |
| `epsilon_search_trials` | `15` | Evaluations for epsilon search |
| `pruning_search_trials` | `16` | Evaluations for pruning search |
| `accuracy_drop_threshold` | `5.0` | Max allowed accuracy drop (%) during search |

#### Returns: `CompressionResult`

| Field | Type | Description |
|---|---|---|
| `model` | `nn.Module` | Compressed model. Drop-in replacement for the original. |
| `compression_ratio` | `float` | `orig_size / comp_size`. e.g. `3.8` = 3.8× smaller. |
| `accuracy_retention` | `float` | `(compressed_acc / original_acc) × 100`. e.g. `96.4` = retained 96.4% of original accuracy. |
| `size_original_mb` | `float` | Original model size in MB |
| `size_compressed_mb` | `float` | Compressed model size in MB |
| `original_accuracy` | `float` | Original top-1 accuracy % |
| `compressed_accuracy` | `float` | Compressed top-1 accuracy % |
| `original_latency_ms` | `float` | Original mean inference latency (ms) |
| `compressed_latency_ms` | `float` | Compressed mean inference latency (ms) |
| `latency_speedup` | `float` | `original_lat / compressed_lat` |
| `cqi` | `float` | Compression Quality Index. >1.0 = better tradeoff than original. |
| `techniques_applied` | `list[str]` | Techniques used in pipeline order |
| `report_path` | `str` or `None` | Absolute path to saved PNG report |
| `pruning_report` | `dict` or `None` | Per-layer pruning details |

---

### `analyze(model, dataloader=None, **kwargs) → AnalysisResult`

Analyses a model and recommends which compression techniques to use.
Does not modify the model.

```python
from compressiontoolkit import analyze

info = analyze(model, dataloader)
print(info)
# AnalysisResult(size=90.6 MB, params=25,557,032, arch=cnn,
#                accuracy=72.14%, recommended=[use_lrf, use_pruning, use_clustering, ...])

# Per-layer signals with analytically derived optimal LRF epsilon
for layer in info.per_layer_signals[:5]:
    print(f"{layer.name:40s}  ε={layer.lrf_epsilon:.3f}  {layer.size_kb:.0f} KB")
```

#### Returns: `AnalysisResult`

| Field | Type | Description |
|---|---|---|
| `size_mb` | `float` | Model size in MB |
| `num_parameters` | `int` | Total trainable parameter count |
| `architecture_type` | `str` | `'cnn'`, `'transformer'`, `'hybrid'`, or `'unknown'` |
| `recommended_techniques` | `list[str]` | Technique flags recommended for this architecture |
| `per_layer_signals` | `list[LayerSignal]` | Per-layer analysis, sorted by size |
| `accuracy` | `float` or `None` | Top-1 accuracy % if dataloader provided |

#### `LayerSignal` fields

| Field | Description |
|---|---|
| `name` | Layer name in the module tree |
| `layer_type` | `'Conv2d'` or `'Linear'` |
| `lrf_epsilon` | Analytically optimal LRF epsilon for this layer (SVD energy threshold) |
| `prunable` | `True` for non-grouped Conv2d layers |
| `size_kb` | Weight tensor size in KB |

---

## Examples

### Compress with default settings

```python
from compressiontoolkit import compress

result = compress(model, dataloader, num_classes=102)
print(f"Compressed {result.size_original_mb:.1f} MB → {result.size_compressed_mb:.1f} MB")
print(f"Accuracy retention: {result.accuracy_retention:.1f}%")
```

### Aggressive compression (all techniques)

```python
result = compress(
    model, dataloader,
    num_classes=102,
    use_pruning=True,
    use_lrf=True,
    use_clustering=True,
    use_kd_finetune=True,
    use_quantization=True,
    use_gptq=True,
    pruning_ratio=0.3,
    lrf_epsilon=0.5,
    num_clusters=16,
    quant_mode='fp16',
    gptq_bits=4,
)
```

### Let the toolkit find optimal hyperparameters automatically

```python
result = compress(
    model, dataloader,
    num_classes=102,
    find_optimal_epsilon=True,
    find_optimal_pruning=True,
    accuracy_drop_threshold=3.0,   # allow max 3% accuracy drop during search
)
```

### Size-only compression (no GPU needed)

```python
result = compress(
    model, dataloader,
    num_classes=102,
    device='cpu',
    use_pruning=False,
    use_lrf=True,
    use_clustering=True,
    use_kd_finetune=False,
    use_quantization=True,
    quant_mode='dynamic',          # INT8, CPU-compatible
)
```

### Analyse before compressing

```python
from compressiontoolkit import analyze, compress

info = analyze(model, dataloader)
print(f"Architecture: {info.architecture_type}")
print(f"Recommended: {info.recommended_techniques}")

# Pass recommended flags directly into compress
flags = {t: True for t in info.recommended_techniques}
result = compress(model, dataloader, num_classes=102, **flags)
```

### Save and load the compressed model

```python
import torch

# Save
torch.save(result.model.state_dict(), 'compressed.pth')

# Load (you need the architecture definition)
model.load_state_dict(torch.load('compressed.pth'))
```

---

## Compression pipeline

Techniques run in this fixed order when enabled:

```
BatchNorm Fusion → Structured Pruning → Low-Rank Factorization
→ Weight Clustering → KD Fine-tuning → Quantization → GPTQ
```

The visualization report (`compression_report.png`) shows a panel for each
technique that was actually used, plus overall accuracy / size / latency /
parameter count comparisons.

---

## Requirements

- Python 3.10, 3.11, or 3.12
- PyTorch ≥ 2.0
- CUDA GPU recommended for fp16 quantization and faster compression
- Google Colab (free T4) works for all features
