Metadata-Version: 2.4
Name: mini-vllm-robotics
Version: 0.1.4
Summary: Portable LLM inference engine for robotics - Token-Routed MoE only
Author-email: Complexity-ML <contact@complexity-ml.com>
License: Apache-2.0
Project-URL: Homepage, https://github.com/Complexity-ML/mini-vllm
Project-URL: Repository, https://github.com/Complexity-ML/mini-vllm
Project-URL: Documentation, https://github.com/Complexity-ML/mini-vllm#readme
Keywords: llm,inference,robotics,moe,token-routed,vllm,transformers
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: torch>=2.0.0
Requires-Dist: transformers>=4.40.0
Requires-Dist: safetensors>=0.4.0
Requires-Dist: huggingface-hub>=0.20.0
Requires-Dist: uvicorn>=0.27.0
Provides-Extra: flash
Requires-Dist: flash-attn>=2.5.0; extra == "flash"
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Provides-Extra: all
Requires-Dist: mini-vllm-robotics[dev,flash]; extra == "all"

# mini-vLLM

**Portable LLM inference engine for robotics.**

A lightweight alternative to vLLM designed for embedded systems and robotics applications. Supports only **Token-Routed MoE** (deterministic routing) - no traditional MoE with top-k gating.

## Why mini-vLLM?

| Feature | vLLM | mini-vLLM |
|---------|------|-----------|
| Continuous Batching | ✅ Custom CUDA | ✅ PyTorch |
| PagedAttention | ✅ Custom kernel | ✅ Simplified |
| KV Cache | ✅ Complex | ✅ Simple |
| Dependencies | ~50+ | **5** |
| Install | Complex | `pip install mini-vllm-robotics` |
| GPU support | NVIDIA only | **Any** (PyTorch) |
| MoE type | All | **Token-Routed only** |

## Installation

```bash
pip install mini-vllm-robotics

# With FlashAttention (recommended)
pip install mini-vllm-robotics[flash]
```

## Quick Start

### Python API

```python
from mini_vllm import LLM

# Load model
llm = LLM("Pacific-Prime/pacific-prime")

# Generate
outputs = llm.generate(["Hello, how are you?"])
print(outputs[0].text)

# Streaming (for robotics)
for token in llm.stream("Tell me about robots"):
    print(token, end="", flush=True)
```

### Server Mode

```bash
# Start server
mini-vllm serve Pacific-Prime/pacific-prime --port 8080

# Query (OpenAI-compatible)
curl http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Hello", "max_tokens": 50}'
```

## Supported Models

### Token-Routed MoE (Deterministic)
- ✅ **Pacific-Prime / Complexity** - INL Dynamics + Token-Routed MLP

### Dense Models (Coming Soon)
- 🔜 Llama / Llama 2 / Llama 3
- 🔜 Qwen / Qwen 2
- 🔜 Phi / Phi 3

### NOT Supported (By Design)
- ❌ Mixtral (top-k MoE)
- ❌ DeepSeek MoE (top-k MoE)
- ❌ Any model with learned gating

## Why No Traditional MoE?

For robotics and embedded systems, **deterministic behavior** is critical.

**Supported:**
- **Dense models** (Llama, Phi, Qwen): No MoE = fully deterministic
- **Token-Routed MoE** (Pacific-Prime): `expert_id = token_id % num_experts` = deterministic

**Not supported:**
- **Learned MoE** (Mixtral, DeepSeek): Softmax + top-k = non-deterministic

Why it matters for robotics:
1. **Reproducibility**: Same input = same output, always
2. **Predictable latency**: No variable routing overhead
3. **Simpler debugging**: Know exactly which expert processes each token
4. **Safety**: No unexpected behavior from learned routing

## Architecture

```
mini-vllm/
├── mini_vllm/
│   ├── engine/          # LLM engine, scheduler, worker
│   ├── core/            # Sequences, KV cache, block manager
│   ├── attention/       # FlashAttention, PagedAttention
│   ├── mlp/             # Dense MLP, Token-Routed MLP
│   ├── models/          # Model implementations
│   ├── server/          # OpenAI-compatible API
│   └── entrypoints/     # CLI and Python API
```

## Robotics Integration

```python
from mini_vllm import LLM

# Initialize once at robot startup
llm = LLM("Pacific-Prime/pacific-prime", device="cuda")

# In robot control loop
def on_user_speech(text: str):
    """Process user speech and respond."""
    for token in llm.stream(text):
        robot.speak(token)  # Stream to TTS
```

## Performance

Compared to HuggingFace Transformers:
- **2-3x faster** with FlashAttention
- **Lower memory** with paged KV cache
- **Batching support** for multiple requests

Compared to vLLM:
- **Simpler install** (5 dependencies vs 50+)
- **More portable** (any PyTorch-supported GPU)
- **Slightly slower** (pure PyTorch vs custom CUDA)

## Contributing

```bash
git clone https://github.com/Complexity-ML/mini-vllm
cd mini-vllm
pip install -e ".[dev]"
pytest
```

## License

Apache-2.0

## Credits

- Inspired by [vLLM](https://github.com/vllm-project/vllm)
- Uses [FlashAttention](https://github.com/Dao-AILab/flash-attention) when available
- Built for [Pacific-Prime](https://huggingface.co/Pacific-Prime/pacific-prime) models
