Metadata-Version: 2.1
Name: at1-compression
Version: 0.2.0
Summary: AT-1 (Atom Teleporter): structure-aware lossless compression for logs, telemetry, tabular, JSON, OSM/geo, genomic VCF, and EEG; ingest-from-URL (at1 fetch) and an honest compressibility audit
Author: Felix Kramer
License: BUSL-1.1
Project-URL: Homepage, https://github.com/FelixKramer/Atom-Teleporter-AT-1-
Keywords: compression,lossless,columnar,vcf,osm,ndjson,logs
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: System :: Archiving :: Compression
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
License-File: NOTICE
Provides-Extra: mcp
Requires-Dist: mcp>=1.0; extra == "mcp"
Provides-Extra: zstd
Requires-Dist: zstandard>=0.20; extra == "zstd"

# AT-1 — Structure-Aware Lossless Compression

[![CI](https://github.com/FelixKramer/Atom-Teleporter-AT-1-/actions/workflows/ci.yml/badge.svg)](https://github.com/FelixKramer/Atom-Teleporter-AT-1-/actions/workflows/ci.yml)

**AT-1** is a production-grade, verified-lossless compression system that exploits the
*semantic structure* of data — columnar fields, delta sequences, sparse matrices,
document skeletons — instead of treating input as a flat byte stream. Two hard guarantees
distinguish it from general-purpose compressors: an **encode-time verification gate**
byte-compares reconstructed output against the original (no output unless identical), and
a **non-inferiority fallback** that always emits whichever is smaller — structural
encoding or raw baseline — so AT-1 can never be worse than LZMA/xz on any input.

Beyond ratio, AT-1's differentiator is **query-while-compressed**: the queryable codec
answers predicate/projection queries reading <1% of a file while the *same* file still
reconstructs the original byte-for-byte (unlike Parquet/ORC, which discard the original).

**Start here:**
[Product status (non-technical)](docs/site/status.html) ·
[Landing page](docs/site/index.html) ·
[Savings calculator](docs/site/roi.html) ·
[Cross-domain proof](docs/WEDGE.md) ·
[Databases](docs/DATABASES.md) ·
[Queryable one-pager](docs/QUERYABLE.md) ·
[SDK](docs/SDK.md) ·
[Platform spec](docs/PLATFORM_SPEC.md) ·
[Go-to-market](docs/GO_TO_MARKET.md) ·
[Pricing](docs/PRICING.md) ·
[Licensing](docs/LICENSE-BSL.md) ·
[Roadmap](ROADMAP.md) ·
[Partner pack](partner_pack/README.md)

---

## Benchmark highlights (real data, byte-for-byte lossless)

| Vertical | Dataset | AT-1 | vs xz-9 | vs next-best standard |
|---|---|---:|---:|---|
| **SSH logs** | 73 MB OpenSSH auth log | 1.57 MB | **2.3×** | 2.3× vs xz |
| **Sensor telemetry** | UCI power (133 MB) | 6.18 MB | **1.9×** | 1.9× vs xz |
| **Weather CSV (quoted)** | NOAA ISD station | 508 KB | **1.24×** | beats brotli-11 |
| **Tabular / drive-stats** | Backblaze SMART data | 245 KB | **1.10×** | 1.06× vs trained OpenZL |
| **Financial tick** | Binance BTCUSDT 72 MB | 2.75 MB | **2.09×** | 4.16× vs gzip |
| **Genomics VCF** | 1000-Genomes chr22 48 MB | 231 KB | **1.48×** | **2.48× vs native BCF** |
| **GeoJSON** | Natural Earth countries | 129 KB | **1.12×** | exact whitespace preserved |
| **Map data (OSM)** | Luxembourg 772 MB | 35.7 MB | **1.46×** | 2.47× vs gzip, 32%+ vs PBF |
| **Network telemetry** | Zeek conn.log 60 MB | 13.5 MB | **1.06×** | — |

---

## Supported domains (codec IDs)

| Codec | Domain | Key technique |
|---|---|---|
| `ssh` (0) | OpenSSH / syslog logs | Syslog-prefix columnar split + optional auto-templating |
| `json` (1) | NDJSON event streams | Path-shred tape + cross-field derivation (URL templates, etc.) |
| `osm` (2) | OpenStreetMap XML | Element delta coding (coords ×1e7, epoch-delta timestamps, tag-key vocab) |
| `log` (3) | Generic text logs | Auto-templating: digit-bearing tokens → variables, skeleton deduplicated |
| `columnar` (4) | CSV / TSV / delimited | INT-Δ, fixed-point DEC, NUMEXC, T_SHUF byte-plane, DERIVED, DICT, RFC-4180 re-quoting |
| `vcf` (5) | Genomic VCF | Sparse genotype: only non-reference (sample-idx delta, vocab-id) pairs stored |
| `jsondoc` (6) | Pretty-printed / GeoJSON | Byte-skeleton + per-path value columns; exact whitespace preserved |
| `qcolumnar` (7) / `qjson` (8) | Queryable CSV / NDJSON | Row-group blocks + zone-map footer (+ optional `--bloom` point-lookup filters); queryable in place, byte-exact |
| `dicom` (9) | Medical imaging | Metadata/pixel split; byte-exact on the `.dcm` |
| `embed` (10) | Float arrays / embeddings (`.npy`) | Byte-plane split; lossless ~1.2× where general tools get ~1.08× |

---

## Quick start

```bash
pip install .                  # installs the `at1` CLI (xz/lzma is stdlib)
pip install '.[zstd]'          # + optional fast zstd backend

at1 compress auto   data.csv    out.at1   # pick the codec automatically (fingerprint + verified bake-off)
at1 compress auto   data.csv    out.at1 --optimize query   # prefer the queryable codec
at1 compress columnar data.csv  out.at1
at1 compress vcf    genome.vcf  out.at1
at1 compress osm    region.osm  out.at1
at1 compress json   events.ndjson out.at1
at1 compress jsondoc map.geojson out.at1
at1 compress dicom  scan.dcm    out.at1  # metadata/pixel split, byte-exact on the .dcm file
at1 compress ssh    auth.log    out.at1  --stream --chunk-lines 500000
at1 decompress out.at1 data.out
at1 verify columnar data.csv           # compress → decompress → cmp, exit 0
```

Backend: `--backend xz` (default, max ratio) or `--backend zstd` (throughput).
Streaming (`--stream`): supported for `log`, `columnar`, `ssh`, `vcf`, `json` — bounded
memory regardless of input size.

### Ingest, audit, and query-then-extract

```bash
at1 fetch https://logs.example/app.log app.at1   # stream a URL straight to a verified .at1
                                                  #   (line codecs: raw plaintext never lands on disk)
at1 audit data.csv                                # honest compressibility report (STRONG/CAPABILITY/PASS)

# query, then PULL THE RESULT OUT as a new verified, re-queryable .at1 — source never rehydrated:
at1 query t.at1 --where price:44000:44100 --extract hot.at1     # columnar / logs / events
at1 sql   t.at1 "SELECT a,b WHERE ts >= 42000"  --extract sub.at1
at1 query g.at1 --region chr22:16050000-16100000 --extract region.at1   # genomic sub-cohort
#   tables: AppendableTable.extract(out, where=...)   bundles: at1_bundle.subset(in, out, [...])
```

### Queryable, verified media (no pixel re-encode)

```bash
at1 media build camera.mp4 camera.at1vid          # wrap any codec's frames + a per-frame index
at1 media query camera.at1vid motion 1.2          # time / motion / scene-cut / similar
at1 media clip  camera.at1vid event.at1vid 21,22,23,24,25   # pull the matching frames OUT, verified
at1 media get   camera.at1vid 42 frame42.png      # one frame, reading only its bytes
at1 media verify event.at1vid                     # byte-exact; a 1-byte tamper is detected + located
at1 media redact camera.at1vid clean.at1vid 0,1   # blank PII, re-stamp chain-of-custody
```

Readers in JS / Rust / Go: `bindings/media/`. See `CHANGELOG.md` for the per-release feature list.

---

## Native C pipeline

All 7 codecs have both a C **encoder** and C **decoder** at full ratio parity with the
Python reference.

```
c_encoder/   at1_encode.exe          # columnar (+ --stream, --threads N for MT-xz)
             at1_log_encode.exe       # log       (+ --stream)
             at1_ssh_encode.exe       # ssh       (+ --stream)
             at1_vcf_encode.exe       # vcf
             at1_json_encode.exe      # NDJSON
             at1_jsondoc_encode.exe   # jsondoc / GeoJSON
             at1_osm_encode.exe       # OSM XML

c_decoder/   at1_decode.exe          # all 7 codecs + RAW + streaming in one binary
```

Build: `cd c_encoder && CC=gcc bash build_and_test.sh` (requires liblzma + libzstd).

**Multithreaded xz (`--threads N`, `0`=all cores):** for multi-GB inputs at xz-9,
all C encoders support `lzma_stream_encoder_mt`. Measured 4.5× speedup at 12 threads
on a 400 MB log (preset 6). See `ENCODE_SPEED.md` for the full speed/ratio frontier.

**Decoder security:** bounds-checked against all attacker-controlled lengths and
indices; malformed `.at1` input exits cleanly (code 2) instead of segfaulting.
Fuzz-tested: 0 crashes in 10,000+ iterations of both a byte-mutation and a
decompressed-content-mutation fuzzer. See `c_decoder/fuzz.py` and `fuzz_streams.py`.

---

## Query in place — from the database you already use

The `qcolumnar` codec is **queryable**: a footer of per-block min/max zone maps lets a query skip
row-groups it can rule out and decode only the columns it touches, while the *same* file still
reconstructs the original **byte-for-byte**. One engine-agnostic decode core
(`duckdb_at1/at1_block.c`, ~260 lines, fuzzed 0 crashes, verified byte-identical to the reference)
feeds every engine — directly in C, or through Apache Arrow:

| Engine | Adapter | Status |
|---|---|---|
| **DuckDB** | native **C** extension (`duckdb_at1/capi/`) + C++ ext (CI) + `at1_duckdb.py` | **built + verified** (C, loads into DuckDB 1.5.3) |
| **SQLite** | native C virtual table (`sqlite_at1/`) | **built + verified live** |
| **PostgreSQL** | native C foreign data wrapper (`postgres_at1/`) | **built + verified live** (real PG16, via Docker) |
| **ClickHouse**, **Spark** | Apache Arrow (`at1_arrow.py`) | **verified on the real engines** |
| **Trino / Presto / Flink** | Postgres/JDBC → AT-1 FDW (`connectors/federation/`) | **verified live** (return identical results) |
| **Polars / pandas / Dask** | Apache Arrow | **verified** (`test_arrow_native.py`) |

Eleven engines query AT-1 live across native-C, Arrow, and Postgres/JDBC federation.

**New (verified this cycle):** an **S3 read-path gateway** (`s3_gateway.py` — SigV4,
ListObjectsV2, RFC 7233 Range) verified by **DuckDB, Spark, and Trino** doing their own
S3 signing; **`?select` pushdown** over that gateway (decodes only touched blocks — 4.9%
of bytes on the demo table, no Parquet materialization); **`at1remote.py`** — query a
`.at1` in any Range-capable bucket with *nothing deployed* (2.5 KB fetched to answer a
1000-row predicate); an **84 KB WASM decoder** (`demo/wasm/`, zstd-profile archives) with
a browser demo at `docs/site/try/`; `--keep-queryable`, `--block-backend zstd`
(fast-decode profile) and `parquet_adapter --cluster-by` (−41% on shuffled events,
opt-in). Real-data experiment log: `docs/EXPERIMENTS_2026-06-09.md`.

**And since then (each line names its verification):** the **C decoder now covers every
codec, 0–12** — including the multi-template `qjson2` and the multi-file **bundle**
(`at1-bundle`: many files → one .at1, per-entry gates, single-entry extraction; C
extracts bundles to a directory with per-entry SHA-256 re-verification) — Go/Rust/Node
bindings re-vendored, zero drift, `go test` green. **Eight installed CLI commands**
(`pip install .`): `at1`, `at1-doctor` (measured savings scanner), `at1-watch`
(auto-tiering with a hash-chained ledger), `at1-live` (stream ingestion, queryable
while landing — verified mid-stream by a second process), `at1-attest` (contents +
history + bytes attestation, with live-verified **RFC 3161 trusted timestamps**),
`at1-sql` (pushdown SQL), `at1-bundle`, `at1-desktop` (local compress/query app).
The appendable table has **compaction, a hash-chained audit trail, and time-travel
queries** (`scan_as_of`); the gateway has **read-path metering** (bills exactly
`bytes_read`) and a live **savings ticker** (`/_at1/ticker`). `at1_page.py` emits a
**single self-contained HTML file that IS a queryable database** (email it; opens on a
phone, nothing installed). `at1_pdf.py` makes **PDF text searchable inside a bundle**
(text-layer PDFs; scanned/OCR honestly out of scope). Standard corpora measured:
**ties xz-9 within 42 B/file on Canterbury, Calgary, enwik8** (`docs/CORPUS_BENCHMARKS.md`).
Honest gap that remains: the **C encoder has no queryable codecs** (qcolumnar/qjson
encode is Python-only; C decode is complete).

```bash
at1 compress qcolumnar trades.csv trades.at1   # queryable, byte-exact
python at1_duckdb.py trades.at1 "SELECT count(*), avg(price) WHERE agg_id BETWEEN 100 AND 200"
python -c "import at1_arrow; print(at1_arrow.to_polars('trades.at1'))"   # any Arrow engine, zero new code
```

Full matrix + reproduce commands: **`docs/ADAPTERS.md`**. The query path is *additive* — it never
re-encodes; the original input is always exactly recoverable from the same file (unlike Parquet/ORC).

---

## Repository structure

```
at1.py                      unified CLI + container (pack/unpack, RAW fallback, streaming)
at1_core.py                 varint / zigzag / vocab primitives
lossless_columnar.py        columnar codec
lossless_ssh.py             SSH/syslog codec
lossless_json_v3.py         NDJSON codec
lossless_osm_v2.py          OSM XML codec
lossless_log.py             generic log codec
lossless_vcf.py             VCF genotype codec
lossless_jsondoc.py         whole-document JSON/GeoJSON codec

c_encoder/                  7 native C encoders + build_and_test.sh + bench.py
c_decoder/                  unified C decoder + Makefile + fuzz test suite (10 vectors)

at1reader.py / at1_duckdb.py / at1_arrow.py   query SDK: scan, DuckDB adapter, universal Arrow bridge
duckdb_at1/                 native C block-decode core (at1_block.c) + DuckDB extension + tests/fuzz
sqlite_at1/                 native SQLite virtual-table adapter (built + verified live)
postgres_at1/               PostgreSQL foreign data wrapper (+ Dockerfile to build/run)
connectors/                 real multi-engine demo: make_demo.py + verify_real.py (ClickHouse, Spark)
bindings/                   Go / Rust / Node native decoder bindings (check_vendor.py keeps them in sync)

AT1_FORMAT_SPEC.md          wire-format specification (all 7 codec IDs, container layouts)
docs/ADAPTERS.md            the engine matrix — one decode core, every database
BENCHMARKS_OPENZL_AND_SPEED.md  head-to-head vs OpenZL, speed frontier
ENCODE_SPEED.md             C encoder speed/ratio table + MT-xz guidance
partner_pack/               partner deliverables (status, wins, GTM, market size, patent addendum)
partner_pack/PRODUCTION_STATUS.md  honest status: what is verified, what is not
PACKAGING.md                pip install docs; live vs legacy module surface
PATENT_APPLICATION.md/.pdf  technical disclosure draft (43 claims, 13 figures)
demo/                       live product demo (real compression + verification + multi-engine query)
test_roundtrip.py           end-to-end regression suite (15 cases, all byte-identical)
```

---

## Key design properties

- **Verified lossless by construction** — the encoder reconstructs and byte-compares
  before emitting; non-conforming records fall through to verbatim storage.
- **Non-inferiority guarantee** — always emits `min(structural, raw-xz)`; worst case
  ties LZMA, never regresses.
- **Per-stream pluggable backends** — xz (max ratio) or zstd (throughput) selectable
  per run; each stream tagged independently in the container.
- **Format spec** — `AT1_FORMAT_SPEC.md` documents every byte; the decoder is
  implementable from the spec alone.
