Metadata-Version: 2.4
Name: ramjetio
Version: 0.5.0
Summary: Distributed cache system for PyTorch training
Author-email: RAMJET <support@ramjet.io>
License-Expression: LicenseRef-PolyForm-Noncommercial-1.0.0
Project-URL: Homepage, https://ramjet.io
Keywords: distributed,cache,pytorch,deep-learning,machine-learning,training
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Operating System :: POSIX :: Linux
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=1.9.0
Requires-Dist: numpy>=1.19.0
Requires-Dist: grpcio>=1.50.0
Requires-Dist: grpcio-tools>=1.50.0
Requires-Dist: protobuf>=4.0.0
Requires-Dist: aiohttp>=3.8.0
Requires-Dist: msgpack>=1.0.0
Requires-Dist: pyyaml>=5.4.0
Requires-Dist: requests>=2.25.0
Requires-Dist: mmh3>=3.0.0
Requires-Dist: diskcache>=5.4.0
Requires-Dist: psutil>=5.8.0
Requires-Dist: boto3>=1.26.0
Provides-Extra: test
Requires-Dist: pytest>=7.0.0; extra == "test"
Requires-Dist: pytest-cov>=4.0.0; extra == "test"
Requires-Dist: pytest-asyncio>=0.21.0; extra == "test"
Dynamic: license-file

# RAMJET — Distributed Data Cache for PyTorch Training

[![Python](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![PyTorch](https://img.shields.io/badge/PyTorch-1.9+-red.svg)](https://pytorch.org/)
[![PyPI](https://img.shields.io/pypi/v/ramjetio.svg)](https://pypi.org/project/ramjetio/)
[![License](https://img.shields.io/badge/License-PolyForm%20NC-green.svg)](LICENSE)

**RAMJET** accelerates PyTorch distributed training by caching preprocessed data across your cluster. Works with any DDP setup — `torchrun`, DeepSpeed, Accelerate, or custom launchers.

## Why RAMJET?

| Problem | Solution |
|---------|----------|
| Slow data preprocessing | Cache preprocessed samples across nodes |
| Network bottleneck from shared storage | Local SSD cache on each node |
| Repeated data loading across epochs | First epoch caches, next epochs are instant |
| No visibility into training | Real-time metrics dashboard |

## Quick Start

### 1. Install

```bash
pip install ramjetio
```

### 2. Get API Key

1. Go to [app.ramjet.io](https://app.ramjet.io)
2. Create a cluster
3. Copy your API key

### 3. Add to Your Training Script

```python
import ramjetio
from torch.utils.data import DataLoader

# Initialize RAMJET (connects to dashboard, starts local cache server)
ramjetio.init()

# Wrap your dataset
dataset = ramjetio.CachedDataset(your_dataset)

# Use with DataLoader as usual
loader = DataLoader(dataset, batch_size=32)

for batch in loader:
    train_step(batch)
```

### 4. Run Training

```bash
export RAMJET_API_KEY="your_api_key_here"

# Works with any launcher
torchrun --nproc_per_node=2 train.py
```

That's it! Your nodes will appear in the dashboard within seconds.

## How It Works

```
┌─────────────────────────────────────────────────────────────┐
│                     Your Training Cluster                    │
├─────────────────────────────────────────────────────────────┤
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐     │
│  │   Node 0    │    │   Node 1    │    │   Node 2    │     │
│  │  ┌───────┐  │    │  ┌───────┐  │    │  ┌───────┐  │     │
│  │  │ Train │  │    │  │ Train │  │    │  │ Train │  │     │
│  │  └───┬───┘  │    │  └───┬───┘  │    │  └───┬───┘  │     │
│  │      │      │    │      │      │    │      │      │     │
│  │  ┌───▼───┐  │    │  ┌───▼───┐  │    │  ┌───▼───┐  │     │
│  │  │ RAMJET  │◄─┼────┼──┤ RAMJET  │◄─┼────┼──┤ RAMJET  │  │     │
│  │  │ Cache │──┼────┼──► Cache │──┼────┼──► Cache │  │     │
│  │  └───────┘  │    │  └───────┘  │    │  └───────┘  │     │
│  │   500GB SSD │    │   500GB SSD │    │   500GB SSD │     │
│  └─────────────┘    └─────────────┘    └─────────────┘     │
│                            │                                 │
│                            ▼                                 │
│                   ┌─────────────────┐                       │
│                   │  RAMJET Dashboard │                       │
│                   │   (Metrics UI)  │                       │
│                   └─────────────────┘                       │
└─────────────────────────────────────────────────────────────┘
```

## Features

- 🚀 **Zero-config caching** — `ramjetio.init()` handles everything
- 📊 **Real-time dashboard** — monitor cache hits, throughput, GPU utilization
- 🔄 **Consistent hashing** — data distributed evenly across nodes
- 💾 **Disk-backed cache** — survives restarts, uses NVMe SSDs efficiently
- 🔌 **Works with any setup** — torchrun, DeepSpeed, Accelerate, custom launchers
- ☁️ **S3/MinIO integration** — configure data source in dashboard, not in code

## Integration Examples

See [docs/INTEGRATION.md](docs/INTEGRATION.md) for detailed examples with:
- PyTorch DDP with `torchrun`
- DeepSpeed
- HuggingFace Accelerate
- Custom training loops
- Multi-node clusters

## Configuration

### Environment Variables

| Variable | Description | Default |
|----------|-------------|---------|
| `RAMJET_API_KEY` | Your API key (required) | — |
| `RAMJET_CACHE_PATH` | Local cache directory | `/tmp/ramjet_cache` |
| `RAMJET_CACHE_SIZE` | Max cache size | `100GB` |
| `RAMJET_PORT` | Cache server port | `9000` |

### Dashboard Settings

Configure in the web dashboard (no code changes needed):
- **Data Source**: S3/MinIO endpoint, bucket, credentials
- **Cache Settings**: TTL, replication factor, eviction policy

## Distributed Training (DDP)

RAMJET automatically detects `torchrun` and DDP environments:

### Single Machine, Multiple GPUs (torchrun)

```bash
# 4 GPUs on one machine
torchrun --nproc_per_node=4 train.py
```

```python
import ramjetio
import torch.distributed as dist

# Only LOCAL_RANK=0 starts cache server - others wait and share it
ramjetio.init()

# All ranks use the same cache
dataset = ramjetio.CachedDataset(your_dataset)
```

### Multi-Node Training

```bash
# Node 0 (master)
torchrun --nnodes=2 --node_rank=0 --nproc_per_node=4 \
         --master_addr=node0 --master_port=29500 train.py

# Node 1
torchrun --nnodes=2 --node_rank=1 --nproc_per_node=4 \
         --master_addr=node0 --master_port=29500 train.py
```

Each node runs one cache server (on LOCAL_RANK=0), and all nodes share data via consistent hashing.

### Separate Processes per Rank

```bash
# If not using torchrun, set env vars manually:
export RANK=0 WORLD_SIZE=4 LOCAL_RANK=0
python train.py

# On another terminal/machine:
export RANK=1 WORLD_SIZE=4 LOCAL_RANK=0  
python train.py
```

RAMJET reads `LOCAL_RANK`, `RANK`, `WORLD_SIZE` from environment to coordinate.

## CLI Tools

```bash
# Start cache server manually (usually not needed — ramjetio.init() does this)
ramjetio-server --port 9000 --capacity 100GB

# Check cache status
ramjetio-client stats

# Clear cache
ramjetio-client clear
```

## Requirements

- Python 3.8+
- PyTorch 1.9+
- Linux (recommended for production)
- SSD storage for cache (recommended)

## Documentation

- [Integration Guide](docs/INTEGRATION.md) — detailed examples for all frameworks
- [API Reference](docs/API.md) — full API documentation
- [Troubleshooting](docs/TROUBLESHOOTING.md) — common issues and solutions

## License

PolyForm Noncommercial License 1.0.0 — free for personal and non-commercial use.
For commercial licensing, contact licensing@ramjet.dev. See [LICENSE](LICENSE) for details.

## Support

- 📧 Email: support@ramjet.io
- 💬 Discord: [discord.gg/ramjet](https://discord.gg/ramjet)
- 📖 Docs: [docs.ramjet.io](https://docs.ramjet.io)
