Metadata-Version: 2.4
Name: ramjetio
Version: 0.6.6
Summary: Distributed cache system for PyTorch training
Author-email: RAMJET <support@ramjet.io>
License-Expression: LicenseRef-PolyForm-Noncommercial-1.0.0
Project-URL: Homepage, https://ramjet.io
Keywords: distributed,cache,pytorch,deep-learning,machine-learning,training
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Operating System :: POSIX :: Linux
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=1.9.0
Requires-Dist: numpy>=1.19.0
Requires-Dist: grpcio>=1.50.0
Requires-Dist: grpcio-tools>=1.50.0
Requires-Dist: protobuf>=4.0.0
Requires-Dist: aiohttp>=3.8.0
Requires-Dist: msgpack>=1.0.0
Requires-Dist: pyyaml>=5.4.0
Requires-Dist: requests>=2.25.0
Requires-Dist: mmh3>=3.0.0
Requires-Dist: diskcache>=5.4.0
Requires-Dist: psutil>=5.8.0
Requires-Dist: boto3>=1.26.0
Requires-Dist: pynvml>=11.0.0
Provides-Extra: gpu
Provides-Extra: test
Requires-Dist: pytest>=7.0.0; extra == "test"
Requires-Dist: pytest-cov>=4.0.0; extra == "test"
Requires-Dist: pytest-asyncio>=0.21.0; extra == "test"
Dynamic: license-file

# RAMJET — Distributed Data Cache for PyTorch Training

[![Python](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![PyTorch](https://img.shields.io/badge/PyTorch-1.9+-red.svg)](https://pytorch.org/)
[![PyPI](https://img.shields.io/pypi/v/ramjetio.svg)](https://pypi.org/project/ramjetio/)
[![License](https://img.shields.io/badge/License-PolyForm%20NC-green.svg)](LICENSE)

**RAMJET** accelerates PyTorch distributed training by caching preprocessed data across your cluster. Works with any DDP setup — `torchrun`, DeepSpeed, Accelerate, or custom launchers.

## Why RAMJET?

| Problem | Solution |
|---------|----------|
| Slow data preprocessing | Cache preprocessed samples across nodes |
| Network bottleneck from shared storage | Local SSD cache on each node |
| Repeated data loading across epochs | First epoch caches, next epochs are instant |
| No visibility into training | Real-time metrics dashboard |

## Quick Start

### 1. Install

```bash
pip install ramjetio
```

### 2. Add to Your Training Script

```python
import ramjetio
from torch.utils.data import DataLoader

ramjetio.init()

dataset = ramjetio.CachedDataset(your_dataset)
loader = DataLoader(dataset, batch_size=32)

for batch in loader:
    train_step(batch)
```

### 3. Run

Get your API key from [app.ramjet.io](https://app.ramjet.io) (create a cluster → copy key).

```bash
export RAMJET_API_KEY="your_api_key_here"
python train.py
```

Multi-GPU: `torchrun --nproc_per_node=N train.py`

That's it! Your nodes will appear in the dashboard within seconds.

## How It Works

```
┌─────────────────────────────────────────────────────────────┐
│                     Your Training Cluster                    │
├─────────────────────────────────────────────────────────────┤
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐     │
│  │   Node 0    │    │   Node 1    │    │   Node 2    │     │
│  │  ┌───────┐  │    │  ┌───────┐  │    │  ┌───────┐  │     │
│  │  │ Train │  │    │  │ Train │  │    │  │ Train │  │     │
│  │  └───┬───┘  │    │  └───┬───┘  │    │  └───┬───┘  │     │
│  │      │      │    │      │      │    │      │      │     │
│  │  ┌───▼───┐  │    │  ┌───▼───┐  │    │  ┌───▼───┐  │     │
│  │  │ RAMJET  │◄─┼────┼──┤ RAMJET  │◄─┼────┼──┤ RAMJET  │  │     │
│  │  │ Cache │──┼────┼──► Cache │──┼────┼──► Cache │  │     │
│  │  └───────┘  │    │  └───────┘  │    │  └───────┘  │     │
│  │   500GB SSD │    │   500GB SSD │    │   500GB SSD │     │
│  └─────────────┘    └─────────────┘    └─────────────┘     │
│                            │                                 │
│                            ▼                                 │
│                   ┌─────────────────┐                       │
│                   │  RAMJET Dashboard │                       │
│                   │   (Metrics UI)  │                       │
│                   └─────────────────┘                       │
└─────────────────────────────────────────────────────────────┘
```

## Features

- 🚀 **Zero-config caching** — `ramjetio.init()` handles everything
- 📊 **Real-time dashboard** — monitor cache hits, throughput, GPU utilization
- 🔄 **Consistent hashing** — data distributed evenly across nodes
- 💾 **Disk-backed cache** — survives restarts, uses NVMe SSDs efficiently
- 🔌 **Works with any setup** — torchrun, DeepSpeed, Accelerate, custom launchers
- ☁️ **S3/MinIO integration** — configure data source in dashboard, not in code

## Integration Examples

See [docs/INTEGRATION.md](docs/INTEGRATION.md) for detailed examples with:
- PyTorch DDP with `torchrun`
- DeepSpeed
- HuggingFace Accelerate
- Custom training loops
- Multi-node clusters

## Configuration

### Environment Variables

| Variable | Description | Default |
|----------|-------------|---------|
| `RAMJET_API_KEY` | Your API key (required) | — |
| `RAMJET_CACHE_PATH` | Local cache directory | `/tmp/ramjet_cache` |
| `RAMJET_CACHE_SIZE` | Max cache size | `100GB` |
| `RAMJET_PORT` | Cache server port | `9000` |

### Dashboard Settings

Configure in the web dashboard (no code changes needed):
- **Data Source**: S3/MinIO endpoint, bucket, credentials
- **Cache Settings**: TTL, replication factor, eviction policy

## Distributed Training (DDP)

RAMJET automatically detects `torchrun` and DDP environments:

### Single Machine, Multiple GPUs (torchrun)

```bash
# 4 GPUs on one machine
torchrun --nproc_per_node=4 train.py
```

```python
import ramjetio
import torch.distributed as dist

# Only LOCAL_RANK=0 starts cache server - others wait and share it
ramjetio.init()

# All ranks use the same cache
dataset = ramjetio.CachedDataset(your_dataset)
```

### Multi-Node Training

RAMJET auto-detects your cluster manager — no manual configuration needed:

| Environment | How to launch | RAMJET detects it? |
|-------------|--------------|--------------------|
| **SLURM** | `srun python train.py` | ✅ Automatic |
| **Kubernetes** (PyTorchJob) | Managed by operator | ✅ Automatic |
| **DeepSpeed** | `deepspeed --hostfile hosts train.py` | ✅ Automatic |
| **Accelerate** | `accelerate launch train.py` | ✅ Automatic |
| **torchrun** | `torchrun --nproc_per_node=N train.py` | ✅ Automatic |
| **SageMaker** | Configured in SageMaker console | ✅ Automatic |

Each node runs one cache server (on `LOCAL_RANK=0`), and all nodes share data via consistent hashing.
RAMJET reads `LOCAL_RANK`, `RANK`, `WORLD_SIZE` from environment — every major launcher sets these automatically.

## CLI Tools

```bash
# Start cache server manually (usually not needed — ramjetio.init() does this)
ramjetio-server --port 9000 --capacity 100GB

# Check cache status
ramjetio-client stats

# Clear cache
ramjetio-client clear
```

## Requirements

- Python 3.8+
- PyTorch 1.9+
- Linux (recommended for production)
- SSD storage for cache (recommended)

## Documentation

- [Integration Guide](docs/INTEGRATION.md) — detailed examples for all frameworks
- [API Reference](docs/API.md) — full API documentation
- [Troubleshooting](docs/TROUBLESHOOTING.md) — common issues and solutions

## License

PolyForm Noncommercial License 1.0.0 — free for personal and non-commercial use.
For commercial licensing, contact licensing@ramjet.dev. See [LICENSE](LICENSE) for details.

## Support

- 📧 Email: support@ramjet.io
- 💬 Discord: [discord.gg/ramjet](https://discord.gg/ramjet)
- 📖 Docs: [docs.ramjet.io](https://docs.ramjet.io)
