Metadata-Version: 2.4
Name: synthed
Version: 0.3.2
Summary: Synthetic data generator for development and testing
License: MIT
Requires-Python: >=3.11
Requires-Dist: numpy>=1.24
Requires-Dist: pandas>=2.0
Requires-Dist: pyarrow>=14.0
Requires-Dist: pydantic>=2.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: rich>=13.0
Requires-Dist: scipy>=1.11
Requires-Dist: typer>=0.9
Provides-Extra: dev
Requires-Dist: duckdb>=0.9; extra == 'dev'
Requires-Dist: mypy>=1.5; extra == 'dev'
Requires-Dist: pytest-cov>=4.0; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: ruff>=0.1; extra == 'dev'
Provides-Extra: ews
Requires-Dist: duckdb>=0.9; extra == 'ews'
Description-Content-Type: text/markdown

# Synthed

**Synthetic data generator for development and testing.**

Synthed scans local datasets, infers structure and behavior, extracts a portable generation spec, and produces realistic synthetic data. Includes a specialized EWS (Early Warning System) data generator for banking pipeline testing.

## Installation

```bash
pip install synthed
```

For development:
```bash
pip install -e ".[dev]"
```

## Quickstart

```bash
# Run the full pipeline on sample data
synthed run-pipeline ./examples/sample_data --out ./artifacts --seed 42

# Or step by step:
synthed scan ./examples/sample_data
synthed build-spec ./examples/sample_data --out ./artifacts/spec
synthed generate --spec ./artifacts/spec/spec.yaml --out ./artifacts/generated --seed 42
synthed evaluate --real ./examples/sample_data --synthetic ./artifacts/generated --out ./artifacts/report

# Generate EWS test data (1000 firms, 4 periods)
synthed ews-generate --out ./ews_output --seed 42 --firms 1000
```

## CLI Commands

| Command | Description |
|---------|-------------|
| `scan` | Discover datasets in a directory |
| `profile` | Profile datasets and output statistics |
| `build-spec` | Build a portable synthetic data spec |
| `generate` | Generate synthetic data from a spec |
| `evaluate` | Evaluate synthetic vs real data |
| `run-pipeline` | Run the full pipeline end-to-end |
| `ews-generate` | Generate EWS pipeline test data (KIK, YIS, EUS, risk, static) |

## How It Works

1. **Scan** — Discover CSV, Parquet, and SQLite files
2. **Profile** — Collect statistics, patterns, and distributions
3. **Infer** — Detect semantic types, primary/foreign keys, constraints
4. **Build Spec** — Produce a portable YAML spec describing the data
5. **Generate** — Create synthetic data from the spec using rule-based generators
6. **Evaluate** — Measure fidelity and quality metrics

## EWS Data Generator

Generates realistic test data for EWS (Early Warning System) banking pipelines:

- **1000 firms** with 8 archetype score trajectories (normal, fast rising, persistent high, regime change, etc.)
- **7 output file types**: KIK CSV (cp1254), YIS/EUS prediction CSVs, risk Parquet, static TXT, EUS target/scope CSVs, training Parquet
- **4 quarterly periods** (configurable)
- **Edge case injection** for robustness testing (NaN scores, empty names, whitespace MUTAs)
- Deterministic output with seed parameter

## Architecture

```
src/synthed/
├── cli/          # CLI application (Typer)
├── config/       # Configuration models (Pydantic)
├── io/           # File discovery and readers
├── profiling/    # Column/table profiling
├── inference/    # Semantic type and relation inference
├── constraints/  # Hard/soft constraint extraction
├── spec/         # Spec models and builder
├── generation/   # Synthetic data generators
├── evaluation/   # Fidelity and quality metrics
├── privacy/      # Data quality checks
├── export/       # Output writers (CSV, Parquet, SQLite, YAML, JSON)
├── reporting/    # Report rendering
├── ews/          # EWS pipeline data generator
└── utils/        # Logging, types
```

## Supported Formats

- Input: CSV, Parquet, SQLite
- Output: CSV, Parquet, SQLite, TXT (cp1254)
- Spec: YAML

## Development

```bash
pip install -e ".[dev]"
pytest
ruff check src/ tests/
mypy src/
```

## License

MIT
