Metadata-Version: 2.4
Name: synkro
Version: 0.1.0
Summary: Generate training datasets from any document
Author: Murtaza Meerza
License-Expression: MIT
License-File: LICENSE
Keywords: dataset-generation,fine-tuning,llm,synthetic-data,training-data
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Requires-Dist: html2text>=2020.1
Requires-Dist: httpx>=0.25
Requires-Dist: instructor>=1.0
Requires-Dist: litellm>=1.40
Requires-Dist: mammoth>=1.6
Requires-Dist: marker-pdf>=0.2
Requires-Dist: pydantic>=2.0
Requires-Dist: rich>=13.0
Requires-Dist: typer>=0.9
Provides-Extra: dev
Requires-Dist: pytest-asyncio>=0.21; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: ruff>=0.1; extra == 'dev'
Description-Content-Type: text/markdown

# Synkro

**Generate training data from any document.**

```python
import synkro

dataset = synkro.generate("All expenses over $50 require manager approval.")
dataset.save()
```

That's it. You now have a JSONL file ready for fine-tuning.

## Installation

```bash
uv pip install synkro
```

Or with pip:

```bash
pip install synkro
```

## Quick Start

### Copy-paste this (it works)

```python
import synkro
from synkro.examples import EXPENSE_POLICY

# Generate 20 training traces (takes ~2 min)
dataset = synkro.generate(EXPENSE_POLICY)

# Save to file (auto-names it)
dataset.save()
```

### From your own text

```python
import synkro

dataset = synkro.generate("""
All expenses over $50 require manager approval.
Expenses over $500 require VP approval.
Receipts required for all purchases over $25.
""")

dataset.save("training.jsonl")
```

### From files

```python
from synkro import Policy

# PDF, DOCX, TXT, MD all work
policy = Policy.from_file("handbook.pdf")
dataset = synkro.generate(policy)
dataset.save()
```

### From URLs

```python
from synkro import Policy

policy = Policy.from_url("https://example.com/terms")
dataset = synkro.generate(policy)
dataset.save()
```

## CLI

```bash
# From file
synkro generate policy.pdf

# From text
synkro generate "All expenses over $50 need approval"

# With options
synkro generate policy.pdf --traces 50 --output training.jsonl

# Quick demo
synkro demo
```

## Output Formats

```python
from synkro import DatasetType

# SFT (default) - Chat messages for supervised fine-tuning
dataset = synkro.generate(policy)
dataset.save("sft.jsonl")

# QA - Question-answer pairs for RAG
dataset = synkro.generate(policy, dataset_type=DatasetType.QA)
dataset.save("qa.jsonl", format="qa")

# DPO - Preference pairs for RLHF
dataset = synkro.generate(policy, dataset_type=DatasetType.DPO)
dataset.save("dpo.jsonl", format="dpo")
```

## Models

Works with any LLM. Defaults to GPT-4o-mini.

```python
from synkro import Generator, OpenAI, Anthropic, Ollama

# API providers
Generator(generation_model=OpenAI.GPT_4O_MINI)
Generator(generation_model=Anthropic.CLAUDE_35_SONNET)

# Local (free, no API key)
Generator(generation_model=Ollama.LLAMA_31_8B)

# Any model string works
Generator(generation_model="groq/llama-3.3-70b-versatile")
```

## What It Does

1. **Analyzes** your document to understand its rules
2. **Generates** diverse scenarios testing comprehension
3. **Creates** expert responses with reasoning + citations
4. **Grades** each response for accuracy
5. **Refines** failures automatically (up to 3x)

Output: JSONL ready for fine-tuning.

## Environment Variables

```bash
export OPENAI_API_KEY="sk-..."
# Or: ANTHROPIC_API_KEY, GEMINI_API_KEY, GROQ_API_KEY
```

## HuggingFace

```python
dataset.to_huggingface().push_to_hub("my-org/dataset")
```

## Built-in Examples

```python
from synkro.examples import (
    EXPENSE_POLICY,
    HR_HANDBOOK,
    REFUND_POLICY,
    SUPPORT_GUIDELINES,
    SECURITY_POLICY,
)

# Use any of these to test instantly
dataset = synkro.generate(EXPENSE_POLICY)
```

## Requirements

- Python 3.10+
- API key for any LLM provider (or Ollama for free local)

## License

MIT
