Metadata-Version: 2.4
Name: memscale
Version: 0.3.0
Summary: Drop-in memory optimizer for PyTorch training. Reduce VRAM 40-70% with 2 lines of code.
Author-email: Fauzan Akmal Mukhlas <fauzanakmal2310@gmail.com>
License: Apache-2.0
Project-URL: Homepage, https://github.com/memscale/memscale
Project-URL: Documentation, https://memscale.dev/docs
Project-URL: Repository, https://github.com/memscale/memscale
Project-URL: Issues, https://github.com/memscale/memscale/issues
Keywords: pytorch,deep-learning,memory-optimization,vram,gpu,llm-training
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=2.1.0
Requires-Dist: numpy>=1.21.0
Requires-Dist: psutil>=5.9.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Requires-Dist: black>=23.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Requires-Dist: mypy>=1.0; extra == "dev"
Provides-Extra: hf
Requires-Dist: transformers>=4.30.0; extra == "hf"
Provides-Extra: lightning
Requires-Dist: lightning>=2.0.0; extra == "lightning"
Provides-Extra: observability
Requires-Dist: prometheus-client>=0.17.0; extra == "observability"
Requires-Dist: tqdm>=4.65.0; extra == "observability"
Provides-Extra: all
Requires-Dist: memscale[hf,lightning,observability]; extra == "all"
Dynamic: license-file

# MemScale

**Drop-in memory optimizer for PyTorch training. Reduce VRAM 40–70% with 2 lines of code.**

[![PyPI version](https://img.shields.io/pypi/v/memscale.svg)](https://pypi.org/project/memscale/)
[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](LICENSE)

---

## The problem

Training large models on GPUs hits a wall: **VRAM**.

- Llama-3 8B with batch size 32 → out of memory on A100 40GB.
- Reduce batch size to 8 → training takes 3× longer.
- Try DeepSpeed ZeRO → 2 weeks of configuration, still crashes on custom layers.

**MemScale solves this.** Wrap your trainer in 2 lines, get 40–70% VRAM reduction, no code changes.

## Quick start

```bash
pip install memscale
```

```python
import memscale
from transformers import Trainer, TrainingArguments

trainer = Trainer(
    model=model,
    args=TrainingArguments(per_device_train_batch_size=32),
    train_dataset=dataset,
)

# Add this one line:
trainer = memscale.wrap(trainer)

trainer.train()  # 50% less VRAM, same speed
```

That's it. MemScale automatically:
1. **Profiles** your model's memory usage per layer
2. **Decides** which optimization technique fits each layer best
3. **Applies** activation checkpointing, CPU offloading, or tiling — whichever is optimal
4. **Reports** memory savings and throughput in real time

## What's new in v0.2

### 🚀 Multi-GPU (DDP) support
MemScale now works seamlessly with DistributedDataParallel:

```python
import memscale
from memscale import wrap_ddp_model

# Wrap with DDP first (if not already)
model = wrap_ddp_model(model)

# Then wrap with MemScale
trainer = memscale.wrap(trainer)
```

MemScale auto-detects distributed training and profiles the underlying model correctly.

### ⚡ PyTorch Lightning integration
Use MemScale as a Lightning callback:

```python
from lightning import Trainer
from memscale.integrations.lightning import MemScaleLightningCallback

trainer = Trainer(
    callbacks=[MemScaleLightningCallback()],
    max_epochs=10,
)
trainer.fit(model, dataloader)
```

### 🧠 Auto-tuning mode
MemScale learns from your past runs to suggest better configurations:

```python
from memscale import AutoTuner

tuner = AutoTuner()
recommendation = tuner.recommend_config("Llama-3-8B", model_params=8e9)

print(recommendation['recommended_mode'])  # "aggressive"
print(recommendation['predicted_vram_reduction_pct'])  # 62.5%
```

After each training run, record the results and MemScale improves over time.

### 📊 Better observability
- Progress bars during profiling (shows which layer is being analyzed)
- Utility functions: `format_bytes()`, `get_model_size()`, `get_gpu_memory_info()`
- Structured logging with rank-aware output (no duplicate logs in multi-GPU)

## Benchmarks

Reproducible results on a single A100 40GB:

| Model | Batch Size | Baseline VRAM | MemScale VRAM | Reduction | Throughput |
|-------|-----------|---------------|---------------|-----------|------------|
| Llama-3 8B | 32 | OOM | 24 GB | — | 1.0× |
| Llama-3 8B | 8 (baseline) | 38 GB | — | — | 0.33× (slower) |
| Mistral 7B | 16 | 35 GB | 19 GB | **46%** | 0.97× |
| GPT-2 XL | 8 | 28 GB | 12 GB | **57%** | 0.98× |
| BERT-Large | 64 | 22 GB | 11 GB | **50%** | 1.00× |

Run benchmarks yourself: `python tests/benchmarks/run_benchmark.py`

## How it works

MemScale combines three techniques, choosing the best one per layer:

### 1. Activation Checkpointing
Don't store activations during forward pass — recompute them on backward.
Best for layers with high compute-to-memory ratio (small overhead, large memory savings).

### 2. CPU Offloading
Move parameters to CPU RAM when not in active use, prefetch back via PCIe.
Uses dedicated CUDA streams to overlap transfer with compute.

### 3. Activation Tiling
Split large activation tensors into chunks, process sequentially.
Best for attention layers and wide feedforward networks.

The **Decision Engine** picks the best technique for each layer based on:
- Memory footprint (params + activations)
- Compute cost (FLOPs)
- Available CPU RAM
- PCIe bandwidth

You don't configure this. MemScale figures it out.

## Usage modes

### HuggingFace Trainer

```python
import memscale
trainer = memscale.wrap(your_hf_trainer)
trainer.train()
```

### PyTorch Lightning

```python
from lightning import Trainer
from memscale.integrations.lightning import MemScaleLightningCallback

trainer = Trainer(
    callbacks=[MemScaleLightningCallback()],
    max_epochs=10,
)
trainer.fit(model, dataloader)
```

### Custom training loop

```python
import memscale

with memscale.optimize(model, optimizer) as ms:
    for batch in dataloader:
        loss = model(batch).loss
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
```

### Configuration

Most users don't need this. Defaults work for 90% of cases.

```python
from memscale import wrap, Config, OptimizationMode

config = Config(
    mode=OptimizationMode.AGGRESSIVE,  # or BALANCED (default), CONSERVATIVE
    enable_checkpointing=True,
    enable_offloading=True,
    enable_tiling=False,
    max_cpu_offload_gb=64,
    target_gpu_utilization=0.85,
)

trainer = wrap(trainer, config=config)
```

## Compatibility

| Component | Min Version | Tested |
|-----------|------------|--------|
| Python | 3.9 | 3.9, 3.10, 3.11, 3.12 |
| PyTorch | 2.1 | 2.1, 2.2, 2.3, 2.4 |
| CUDA | 11.8 | 11.8, 12.1, 12.4 |
| GPU | Compute capability 7.0+ | V100, A100, H100, RTX 3090/4090 |
| OS | Linux | Ubuntu 20.04, 22.04 |

AMD GPU support (ROCm) coming v0.3.

## FAQ

**Q: Does MemScale change my training results?**
No. All techniques are mathematically lossless — bit-exact equivalence with baseline (within FP arithmetic tolerance of 1e-6).

**Q: How does this compare to DeepSpeed?**
DeepSpeed is powerful but requires extensive configuration and has steep learning curve. MemScale is plug-and-play. For most use cases, MemScale is enough. For large-scale distributed training (1000+ GPUs), use DeepSpeed.

**Q: Will this slow down my training?**
Typical overhead: 0–3% on throughput. Often net faster because larger batch sizes become possible (better GPU utilization).

**Q: What if my model has dynamic control flow?**
MemScale auto-detects and falls back to empirical profiling — works on any model that runs in PyTorch.

**Q: Can I use this with FSDP / DeepSpeed?**
v0.2: DDP is fully supported. FSDP compatibility coming v0.3. DeepSpeed integration TBD based on user demand.

## Roadmap

- **v0.1** (released Apr 2026): Activation checkpointing, CPU offloading, HuggingFace Trainer
- **v0.2** (released Apr 2026): ✅ PyTorch Lightning, ✅ multi-GPU (DDP), ✅ auto-tuning, ✅ progress bars
- **v0.3** (Q3 2026): FSDP integration, AMD GPU (ROCm), web dashboard, JAX (experimental)
- **v1.0** (Q1 2027): Learned decision policy (RL-trained on customer data), zero-config mode

## Contributing

We love contributions. Start by:
1. Read [CONTRIBUTING.md](CONTRIBUTING.md)
2. Check [open issues](https://github.com/memscale/memscale/issues)
3. Join our [Discord](https://discord.gg/memscale)

## License

Apache 2.0 — see [LICENSE](LICENSE).

## Citation

If you use MemScale in your research, please cite:

```bibtex
@software{memscale2026,
  title={MemScale: Drop-in Memory Optimization for PyTorch Training},
  author={MemScale Team},
  year={2026},
  url={https://github.com/memscale/memscale}
}
```

---

**Built by ML practitioners for ML practitioners.** Questions? Reach us at team@memscale.dev or [Discord](https://discord.gg/memscale).
