Metadata-Version: 2.4
Name: simgen-vla
Version: 2.8.0
Summary: VLA: Zero-Error GPU Arithmetic for PyTorch. Exact results, every calculation.
Home-page: https://simgen.dev
Author: Clouthier Simulation Labs
Author-email: Clouthier Simulation Labs <kyle@simgen.dev>
License-Expression: LicenseRef-Proprietary
Project-URL: Homepage, https://simgen.dev
Project-URL: Documentation, https://simgen.dev/docs
Project-URL: Repository, https://github.com/clouthier-simulation-labs/simgen
Keywords: exact-arithmetic,GPU,precision,lossless,scientific-computing,machine-learning,deep-learning,simulation,finance,HPC,cuda,pytorch
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Financial and Insurance Industry
Classifier: Operating System :: POSIX :: Linux
Classifier: Operating System :: Microsoft :: Windows
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Mathematics
Classifier: Topic :: Scientific/Engineering :: Physics
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=2.0
Requires-Dist: cupy-cuda12x>=12.0
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: mpmath; extra == "dev"
Dynamic: author
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-python

# SimGen VLA

**Zero-Error GPU Arithmetic for PyTorch**

VLA eliminates floating-point accumulation errors in GPU computations. Every sum, every matrix multiply, every gradient update - mathematically exact.

## Installation

```bash
pip install simgen-vla
```

**Requirements:**
- Python 3.10+
- PyTorch 2.0+
- CUDA GPU (Turing, Ampere, Ada, or Hopper)
- CuPy (auto-installed)

## Quick Start

```python
from simgen import vla

# Exact operations
result = vla.sum(tensor)       # Zero accumulation error
result = vla.matmul(a, b)      # Exact matrix multiply
result = vla.softmax(logits)   # Numerically stable

# Optimizer that doesn't drift
optimizer = vla.AdamW(model.parameters(), lr=1e-3)

# Enable globally
vla.enable()
torch.sum(x)  # Now uses VLA automatically!
vla.disable()

# System info
vla.info()
```

## Why VLA?

Standard floating-point arithmetic loses precision with every operation:

```python
# The Problem: Standard FP32 fails
x = torch.tensor([1e8, 1.0, -1e8, 1e-8, 1e-8], device='cuda')
print(x.sum())  # 0.0 (WRONG!)

# VLA gets it right
print(vla.sum(x))  # 1.00000002 (CORRECT!)
```

| Operation | Standard Error | VLA Error |
|-----------|---------------|-----------|
| Sum 1M values | ~10⁻¹⁰ | **0** |
| MatMul 1024x1024 | ~10⁻⁷ | **< 10⁻¹⁵** |
| 1000 Adam steps | Drift | **Exact** |

### Real-World Impact

- **Training stability**: No gradient drift over long runs
- **Reproducibility**: Same bits on any GPU (RTX 4070 = A100 = H100)
- **Financial accuracy**: $0.0001 × 1M transactions = $100 (not lost)
- **Scientific precision**: ODE integration without energy drift

## Operations (55 Kernels)

### Reductions
`sum`, `mean`, `var`, `std`, `norm`, `dot`, `prod`, `cumsum`, `logsumexp`, `min`, `max`, `argmin`, `argmax`

### Matrix Operations
`matmul`, `mm`, `bmm`, `linear`, `einsum`

### Activations
`softmax`, `log_softmax`, `relu`, `gelu`, `silu`, `sigmoid`, `tanh`, `leaky_relu`

### Normalization
`layer_norm`, `rms_norm`, `batch_norm`, `group_norm`

### Loss Functions
`cross_entropy`, `mse_loss`

### Element-wise
`add`, `sub`, `mul`, `div`, `neg`, `abs`, `exp`, `log`, `sqrt`, `rsqrt`, `pow`, `clamp`

### Advanced
`scaled_dot_product_attention`, `conv2d`, `embedding`, `dropout`

### Optimizers
`AdamW`, `SGD` (FP64 state - no drift over 1000s of steps)

## Usage Patterns

### 1. Direct Functions
```python
from simgen import vla

loss = vla.cross_entropy(logits, targets)
normalized = vla.layer_norm(x, weight, bias)
attn = vla.scaled_dot_product_attention(q, k, v)
```

### 2. Global Patching
```python
vla.enable()
# All torch.sum, torch.matmul now use VLA
model.train()
vla.disable()
```

### 3. Context Manager
```python
with vla.mode():
    output = model(x)
    loss = criterion(output, y)
```

### 4. Exact Optimizer
```python
# FP64 state prevents momentum/variance drift
optimizer = vla.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)

for epoch in range(1000):
    loss = model(x)
    loss.backward()
    optimizer.step()  # No drift, ever
```

## Supported GPUs

| Architecture | GPUs | Status |
|--------------|------|--------|
| sm_60 (Pascal) | GTX 1080, P100 | ✓ |
| sm_61 (Pascal) | GTX 1050, 1060 | ✓ |
| sm_70 (Volta) | V100 | ✓ |
| sm_75 (Turing) | T4, RTX 2080 | ✓ |
| sm_80 (Ampere) | A100, A10, RTX 3090 | ✓ |
| sm_86 (Ampere) | RTX 3060/3070 | ✓ |
| sm_89 (Ada) | RTX 4070/4080/4090 | ✓ |
| sm_90 (Hopper) | H100 | ✓ |

## Benchmarks

VLA adds minimal overhead while providing exact results:

| Operation | Standard | VLA | Overhead |
|-----------|----------|-----|----------|
| Sum (1M) | 0.12ms | 0.15ms | 1.25x |
| MatMul (1024²) | 0.8ms | 1.1ms | 1.4x |
| Softmax (batch) | 0.05ms | 0.06ms | 1.2x |

*Benchmarks on RTX 4070. VLA uses FP64 accumulation with error tracking.*

## How It Works

VLA uses proprietary multi-precision accumulation to capture all rounding errors during computation. The result is mathematically exact to the precision of the input.

**Key innovations:**
- Proprietary precision-preserving arithmetic
- Multi-level error capture for reductions
- FP64 optimizer state for training stability
- Precompiled CUDA kernels for each GPU architecture

## Version History

- **v2.4.0** - Clean API (`vla.sum`, `vla.matmul`), 55 kernels, Windows support
- **v2.0.x** - Native CUDA kernels, VLAResult container
- **v1.x** - Initial release with Triton backend

## License

Proprietary. All rights reserved.
© 2025-2026 Clouthier Simulation Labs

**Contact:** kyle@simgen.dev
**Website:** https://simgen.dev
