Metadata-Version: 2.4
Name: synthed
Version: 0.1.0
Summary: Privacy-first synthetic data blueprint generator
License: MIT
Requires-Python: >=3.11
Requires-Dist: numpy>=1.24
Requires-Dist: pandas>=2.0
Requires-Dist: pyarrow>=14.0
Requires-Dist: pydantic>=2.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: rich>=13.0
Requires-Dist: scipy>=1.11
Requires-Dist: typer>=0.9
Provides-Extra: dev
Requires-Dist: mypy>=1.5; extra == 'dev'
Requires-Dist: pytest-cov>=4.0; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: ruff>=0.1; extra == 'dev'
Description-Content-Type: text/markdown

# Synthed

**Privacy-first synthetic data blueprint generator.**

Synthed scans local datasets, infers structure and behavior, extracts a portable generation spec, and produces realistic synthetic data that preserves structural realism without reproducing original records.

> **Disclaimer:** This tool reduces privacy risk but does not automatically establish legal anonymization. Users are responsible for compliance with applicable data protection regulations.

## Quickstart

```bash
# Install
pip install -e ".[dev]"

# Run the full pipeline on sample data
synthed run-pipeline ./examples/sample_data --out ./artifacts --seed 42

# Or step by step:
synthed scan ./examples/sample_data
synthed build-spec ./examples/sample_data --out ./artifacts/spec
synthed generate --spec ./artifacts/spec/spec.yaml --out ./artifacts/generated --seed 42
synthed evaluate --real ./examples/sample_data --synthetic ./artifacts/generated --out ./artifacts/report
```

## CLI Commands

| Command | Description |
|---------|-------------|
| `scan` | Discover datasets in a directory |
| `profile` | Profile datasets and output statistics |
| `build-spec` | Build a portable synthetic data spec |
| `generate` | Generate synthetic data from a spec |
| `evaluate` | Evaluate synthetic vs real data |
| `run-pipeline` | Run the full pipeline end-to-end |

## How It Works

1. **Scan** — Discover CSV, Parquet, and SQLite files
2. **Profile** — Collect statistics, patterns, and distributions
3. **Infer** — Detect semantic types, primary/foreign keys, constraints
4. **Build Spec** — Produce a portable YAML spec describing the data
5. **Generate** — Create synthetic data from the spec using rule-based generators
6. **Evaluate** — Measure fidelity, utility, and privacy risk

## Architecture

```
src/synthed/
├── cli/          # CLI application (Typer)
├── config/       # Configuration models (Pydantic)
├── io/           # File discovery and readers
├── profiling/    # Column/table profiling
├── inference/    # Semantic type and relation inference
├── constraints/  # Hard/soft constraint extraction
├── spec/         # Spec models and builder
├── generation/   # Synthetic data generators
├── evaluation/   # Fidelity and quality metrics
├── privacy/      # Privacy risk checks
├── export/       # Output writers (CSV, Parquet, SQLite, YAML, JSON)
├── reporting/    # Report rendering
└── utils/        # Logging, types
```

## Privacy Controls

- **Exact-row overlap detection** — Detects if synthetic rows match real rows
- **High-risk identifier regeneration** — IDs, keys, and PII columns are regenerated
- **Rare-group leakage detection** — Flags groups below configurable size threshold
- **Small dataset warnings** — Extra caution for datasets under 100 rows
- **Privacy report** — Generated on every run

## Supported Formats

- Input: CSV, Parquet, SQLite
- Output: CSV, Parquet, SQLite
- Spec: YAML

## Development

```bash
pip install -e ".[dev]"
pytest
ruff check src/ tests/
mypy src/
```

## Known Limitations

- V1 supports tabular/relational data only (no images, PDFs, free text)
- Statistical generators are rule-based, not ML-based
- Does not provide legal anonymization guarantees
- FK inference is heuristic-based (name matching + value overlap)
- Large datasets may require sampling for profiling

## License

MIT
