Metadata-Version: 2.4
Name: synkro
Version: 0.4.4
Summary: Generate training datasets from any document
Author: Murtaza Meerza
License-Expression: MIT
License-File: LICENSE
Keywords: dataset-generation,fine-tuning,llm,synthetic-data,training-data
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Requires-Dist: beautifulsoup4>=4.12
Requires-Dist: html2text>=2020.1
Requires-Dist: httpx>=0.25
Requires-Dist: litellm>=1.40
Requires-Dist: mammoth>=1.6
Requires-Dist: marker-pdf>=0.2
Requires-Dist: pydantic>=2.0
Requires-Dist: python-dotenv>=1.0
Requires-Dist: rich>=13.0
Requires-Dist: typer>=0.9
Provides-Extra: dev
Requires-Dist: pytest-asyncio>=0.21; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: ruff>=0.1; extra == 'dev'
Description-Content-Type: text/markdown

# Synkro

Turn policies, handbooks, and documentation into high-quality training data for fine-tuning LLMs.

## Features

- **Quality Evaluation** - Each response is graded and automatically refined if it fails
- **Multiple Formats** - SFT (chat), QA (question-answer), and Tool Calling
- **Tool Call Training** - Generate OpenAI function calling format for teaching models to use custom tools
- **Top LLM Providers** - OpenAI, Anthropic, and Google
- **File Support** - PDF, DOCX, TXT, Markdown, URLs
- **CLI Included** - Generate datasets from the command line

## Installation

```bash
pip install synkro
```

## Quick Start

```python
from synkro.pipelines import create_pipeline
from synkro.models.google import Google
from synkro.types import DatasetType

pipeline = create_pipeline(
    model=Google.GEMINI_25_FLASH,          # Fast generation
    grading_model=Google.GEMINI_25_PRO,    # Quality grading
    dataset_type=DatasetType.SFT,
)

dataset = pipeline.generate(
    "All expenses over $50 require manager approval.",
    traces=50,
)
dataset.save("training.jsonl")
```

### From Files

```python
from synkro.pipelines import create_pipeline
from synkro.core.policy import Policy

policy = Policy.from_file("handbook.pdf")  # PDF, DOCX, TXT, MD
pipeline = create_pipeline()
dataset = pipeline.generate(policy, traces=100)
dataset.save()
```

### From URLs

```python
from synkro.core.policy import Policy

policy = Policy.from_url("https://example.com/terms")
dataset = pipeline.generate(policy)
```

## Dataset Types

| Format | Output | Best For |
|--------|--------|----------|
| **SFT** | Chat messages | Fine-tuning chat models |
| **QA** | Question-answer pairs | RAG systems, knowledge bases |
| **TOOL_CALL** | Function calling format | Training models to use custom tools |

### SFT (Default)

```python
from synkro.types import DatasetType

pipeline = create_pipeline(dataset_type=DatasetType.SFT)
dataset = pipeline.generate(policy)
```

Output:
```json
{"messages": [
  {"role": "system", "content": "You are a policy expert..."},
  {"role": "user", "content": "What's the approval process for $350?"},
  {"role": "assistant", "content": "For a $350 expense, you need..."}
]}
```

### QA

```python
pipeline = create_pipeline(dataset_type=DatasetType.QA)
```

Output:
```json
{"question": "What's the approval process?", "answer": "You need...", "context": "..."}
```

### Tool Calling

Generate training data for teaching models when and how to use your custom tools:

```python
from synkro import create_pipeline, ToolDefinition, DatasetType

# Define your tools
web_search = ToolDefinition(
    name="web_search",
    description="Search the web for current information",
    parameters={
        "type": "object",
        "properties": {
            "query": {"type": "string", "description": "Search query"}
        },
        "required": ["query"]
    },
    mock_responses=["NYC: 72°F, sunny", "BTC: $67,234"]
)

# Create pipeline with tools
pipeline = create_pipeline(
    dataset_type=DatasetType.TOOL_CALL,
    tools=[web_search],
)

# Generate from tool usage guidelines
dataset = pipeline.generate("""
Use web_search for real-time data like weather, prices.
Answer general questions directly without tools.
""", traces=20)

dataset.save("tool_training.jsonl", format="tool_call")
```

Output (OpenAI function calling format):
```json
{"messages": [
  {"role": "system", "content": "You have access to: web_search..."},
  {"role": "user", "content": "What's the weather in NYC?"},
  {"role": "assistant", "content": null, "tool_calls": [
    {"id": "call_abc", "type": "function", "function": {"name": "web_search", "arguments": "{\"query\": \"weather NYC\"}"}}
  ]},
  {"role": "tool", "tool_call_id": "call_abc", "content": "NYC: 72°F, sunny"},
  {"role": "assistant", "content": "The weather in NYC is 72°F and sunny."}
]}
```

## Evaluation & Grading

Every response is graded on policy compliance, citations, and reasoning. Failed responses are automatically refined (up to N iterations).

```python
from synkro.pipelines import create_pipeline
from synkro.models.openai import OpenAI

pipeline = create_pipeline(
    model=OpenAI.GPT_4O_MINI,       # Fast generation
    grading_model=OpenAI.GPT_4O,    # Quality grading
    max_iterations=3,               # Refinement attempts
)

dataset = pipeline.generate(policy, traces=100)

# Check quality
print(f"Pass rate: {dataset.passing_rate:.1%}")

# Filter to only passing traces
high_quality = dataset.filter(passed=True)
high_quality.save("training.jsonl")
```

## CLI

```bash
# From file
synkro generate policy.pdf --traces 50 --format sft

# From text
synkro generate "All expenses over $50 need approval" -n 20

# From URL
synkro generate https://example.com/policy -o training.jsonl
```

**Options:**
- `--traces, -n` - Number of traces (default: 20)
- `--format, -f` - Output format: sft, qa, or tool_call (default: sft)
- `--output, -o` - Output file path
- `--model, -m` - Model for generation