Metadata-Version: 2.4
Name: dynabatch
Version: 0.1.6
Summary: PyTorch DataLoader with dynamic batch sizing guided by a pre-trained GPU memory classifier.
Author: Bendang
License-Expression: MIT
Project-URL: Repository, https://github.com/bendangnuksung/dynabatch
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch
Requires-Dist: scikit-learn
Requires-Dist: numpy
Requires-Dist: pandas
Dynamic: license-file

# dynabatch

A drop-in replacement for PyTorch's `DataLoader` that eliminates padding waste by using a pre-trained classifier to dynamically size each batch so that GPU memory usage stays safe.

## Installation

```bash
pip install -r requirements.txt
```

## Quick Start

```python
from dynabatch import build_dynamic_batch_dataloader

dataloader = build_dynamic_batch_dataloader(
    texts=your_texts,
    tokenizer=your_tokenizer,
    batch_size=64,               # minimum batch size (used for the first, hardest batch)
    max_input_token_length=512,  # hard truncation limit per sequence
)

for batch in dataloader:
    outputs = model.generate(**batch)
```

## How It Works

### The Problem

A standard `DataLoader` pads every batch to the longest sequence in the dataset. If your longest text is 512 tokens but most are 20-50 tokens, you waste enormous GPU compute on padding. Fixed-size batches also risk OOM on long-sequence batches or underutilise the GPU on short-sequence batches.

### The Solution

dynabatch sorts sequences by length (longest first) and uses a **pre-trained classifier** to decide how many sequences to pack into each batch, ensuring GPU memory never exceeds ~90% of the first batch's peak allocation.

#### Step-by-step

1. **Tokenize and sort**: All input texts are tokenized and sorted by token length in descending order.

2. **First batch as baseline**: The first batch uses exactly `batch_size` items — these are the longest sequences, making it the hardest batch. If your GPU survives this batch, it is guaranteed to survive every subsequent batch.

3. **Classifier-guided scaling**: For each subsequent batch, the system generates candidate batch sizes (from `1x` to `6x` the base `batch_size`) and asks the classifier: *"what is the probability that this batch configuration will cause a GPU memory spike?"* It picks the largest candidate whose spike probability stays below `threshold` (default 2.5%).

### The Classifier

The classifier is a `HistGradientBoostingClassifier` (scikit-learn) trained on real GPU memory profiles. It was trained as follows:

#### 1. Data Collection (`train_classifier/generate_training_data.py`)

Training data is generated by running actual inference across multiple models (NLLB-600M, ALMA-7B, MarianNMT, etc.) with different configurations:

- Multiple `batch_size` values: 1, 4, 8, 64, 128, 256, 512
- Multiple `max_input_length` values: 128, 256, 512
- Text data augmented to cover a wide distribution of sequence lengths (1-500 words)

For each run, sequences are sorted by length and processed with progressively increasing batch sizes. At every step the script records:
- **GPU peak memory usage** (via `torch.cuda.max_memory_allocated`)
- Batch size, token counts, padding counts, and the top token length in the batch

This is run multiple times with different models to make the classifier model-agnostic.

#### 2. Feature Engineering (`train_classifier/train_classifier.py`)

Each data point captures the relationship between the **first batch** (baseline) and the **current batch**:

| Feature | Description |
|---|---|
| `max_input_length` | Hard truncation limit |
| `token_size_x` / `token_size_y` | Longest token length in the first / current batch |
| `token_size_diff` | `token_size_y / token_size_x` |
| `batch_size_x` / `batch_size_y` | First / current batch size |
| `batch_size_diff` | `batch_size_y / batch_size_x` |
| `total_tokens_x` / `total_tokens_y` | Total tokens in first / current batch |
| `total_token_size_diff` | `total_tokens_y / total_tokens_x` |
| `paddings_x` / `paddings_y` | Total padding tokens in first / current batch |
| `total_paddings_diff` | `paddings_y / paddings_x` |

The target label is binary: **1** if `gpu_memory / first_batch_peak_gpu > 0.90` (spike), **0** otherwise.

#### 3. Why It Generalises

By expressing everything as **ratios relative to the first batch** rather than absolute values, the classifier learns patterns that transfer across models and hardware. It doesn't need to know your specific GPU's VRAM or your model's parameter count — it only needs to know how the current batch compares to the first batch, which serves as an empirical calibration point.

### Runtime Behaviour

The `threshold` parameter controls conservatism:
- **Lower threshold** (e.g. 0.01): very conservative, less risk of OOM, slightly more padding waste
- **Higher threshold** (e.g. 0.10): more aggressive packing, higher throughput, slightly more OOM risk

What the job feels like at runtime:
- **Early batches**: slow, few items, long sequences
- **Later batches**: progressively faster, more items, shorter sequences
- The job naturally accelerates as it runs

## API

### `build_dynamic_batch_dataloader`

```python
build_dynamic_batch_dataloader(
    texts: list[str],
    tokenizer: PreTrainedTokenizerBase,
    batch_size: int,
    max_input_token_length: int,
    threshold: float = 0.025,
    shuffle: bool = False,
    num_workers: int = 4,
    batch_start_range: float = 1.0,
    batch_end_range: float = 6.0,
    steps: int = 50,
    **tokenizer_kwargs,
) -> DataLoader
```

| Parameter | Description |
|---|---|
| `texts` | Raw input strings of any length |
| `tokenizer` | Any HuggingFace tokenizer |
| `batch_size` | Minimum batch size, used for the first (hardest) batch. The classifier scales up from this for shorter sequences |
| `max_input_token_length` | Hard truncation limit per sequence |
| `threshold` | Max spike probability tolerated per candidate batch size (default 0.025 = 2.5%) |
| `shuffle` | Shuffle the order of pre-built batches (sequences within a batch remain length-similar) |
| `num_workers` | Parallel data loading workers |
| `batch_start_range` | Lower multiplier bound for candidate batch sizes relative to `batch_size` (default 1.0) |
| `batch_end_range` | Upper multiplier bound for candidate batch sizes relative to `batch_size` (default 6.0) |
| `steps` | Number of candidate batch sizes to evaluate (default 50) |

Returns a `DataLoader` yielding dicts with `input_ids`, `attention_mask`, `texts`, and any other keys from your tokenizer, as PyTorch tensors.

## Retraining the Classifier

If you want to retrain the classifier for your specific hardware or models:

```bash
# 1. Generate training data (run multiple times with different models)
python train_classifier/generate_training_data.py

# 2. Train the classifier
python train_classifier/train_classifier.py
```

Copy the output `classifier.pkl` into `dynabatch/models/` to use it at runtime.
