Metadata-Version: 2.4
Name: striq
Version: 0.1.1
Summary: Python bindings for STRIQ — lossy time-series compression with algebraic queries on compressed data
Author: Nahum Ochoa
License-Expression: Apache-2.0
Project-URL: Homepage, https://github.com/NahumResearch/py-striq
Project-URL: Repository, https://github.com/NahumResearch/py-striq
Project-URL: PyPI, https://pypi.org/project/striq/
Project-URL: C Library, https://github.com/NahumResearch/striq
Keywords: time-series,compression,IoT,sensor,query,pla,chebyshev
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: MacOS
Classifier: Operating System :: POSIX :: Linux
Classifier: Operating System :: Microsoft :: Windows
Classifier: Programming Language :: C
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Database
Classifier: Typing :: Typed
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: cffi>=1.17
Requires-Dist: numpy>=1.26
Provides-Extra: pandas
Requires-Dist: pandas>=2.1; extra == "pandas"
Provides-Extra: all
Requires-Dist: pandas>=2.1; extra == "all"
Dynamic: license-file

# striq

**Compress time series. Query without decompressing. From Python.**

![License](https://img.shields.io/badge/license-Apache%202.0-blue)
![PyPI](https://img.shields.io/pypi/v/striq)
![Python](https://img.shields.io/pypi/pyversions/striq)
![CI](https://img.shields.io/github/actions/workflow/status/NahumResearch/py-striq/ci.yml?branch=main&label=CI)
![Language](https://img.shields.io/badge/language-Python%20%2B%20C11-orange)

Python bindings for [libstriq](https://github.com/NahumResearch/striq), a C11 library that compresses floating-point sensor data with user-controlled error bounds and executes aggregate queries **directly on compressed data** in sub-microsecond time.

> Your IoT gateway generates 1 GB/day of sensor readings. With `striq` you compress it to ~150 MB from Python and still compute the average temperature last month in 0.4 us — without decompressing a single byte.

---

## At a Glance

Jena Climate dataset, 420 551 rows, column `p (mbar)`, epsilon = 0.01:

| Metric | STRIQ | Gorilla | LZ4 | Zstd |
|---|---|---|---|---|
| Compressed size | **0.9 MB** | 2.4 MB | 1.3 MB | 0.8 MB |
| Compression ratio | **3.43x** | 1.32x | 2.38x | 4.02x |
| `mean()` latency | **0.4 us** | 34,267 us* | 979 us* | 3,427 us* |
| Encode throughput | 615 MB/s | 92 MB/s | 564 MB/s | 331 MB/s |

Raw column: 3.2 MB (420 551 x float64). *Require full decompression before any query.

Typical climate/weather columns land in the 2.5x-3.5x range. High-autocorrelation signals (near-constant readings) reach 7x. See the [C library benchmarks](https://github.com/NahumResearch/striq#detailed-benchmarks) for the full dataset breakdown.

---

## Install

### From PyPI

```bash
pip install striq            # core (numpy)
pip install striq[all]       # + pandas
```

### From source (development)

```bash
git clone https://github.com/NahumResearch/py-striq.git
cd py-striq
git submodule update --init   # pulls the C library into vendor/striq/
pip install -e ".[all]"
```

You can also point to a local checkout of the C library:

```bash
STRIQ_C_ROOT=/path/to/striq pip install -e ".[all]"
```

---

## Quick Start

### Write from a DataFrame

```python
import striq
import pandas as pd
import numpy as np

df = pd.DataFrame({
    "temp": np.random.normal(23, 2, 10_000),
    "humidity": np.random.normal(65, 5, 10_000),
}, index=pd.date_range("2025-01-01", periods=10_000, freq="s"))

striq.from_dataframe(df, "sensor.striq", epsilon=0.5)
```

### Query

```python
with striq.open("sensor.striq") as r:
    print(r.mean("temp"))       # Result(23.01 +/- 0.50, 10000 rows, 100% algebraic)
    print(r.min("temp"))
    print(r.max("temp"))
    print(r.count())

    # Flexible time inputs
    r.mean("temp", since="2025-01-01 02:00")
    r.mean("temp", last="1h")

    # Conditional aggregates
    r.mean_where("temp", "> 25")

    # Scan rows into a DataFrame
    df = r.scan(columns=["temp", "humidity"], max_rows=5000)
```

### Downsample + matplotlib

```python
import matplotlib.pyplot as plt

with striq.open("sensor.striq") as r:
    ts, vals = r.downsample("temp", n=200)
    plt.plot(ts.astype("datetime64[ns]"), vals)
    plt.title("Temperature (200-point downsample)")
    plt.show()
```

### In-Memory Streaming (Store)

```python
with striq.Store({"temp": striq.Float64}, epsilon=0.5) as s:
    for ts_ns, value in stream:
        s.push(ts_ns, [value])
    print(s.mean("temp"))

# With cold file persistence
with striq.Store({"temp": striq.Float64},
                  epsilon=0.5,
                  cold_path="archive.striq") as s:
    s.push(ts_ns, [23.5])
    s.sync()
```

### Inspect and Verify

```python
info = striq.inspect("sensor.striq")
print(info.total_rows, info.compression_ratio)

result = striq.verify("sensor.striq")
print(result.ok)
```

---

## When to Use

**Good fit:**
- IoT / industrial sensor data (temperature, pressure, vibration, RPM)
- Time series with moderate-to-high autocorrelation (most physical sensors)
- Queryable compressed archives without a decompression step
- Edge devices with limited storage: Raspberry Pi, gateways, embedded Linux
- Cold storage that still needs to answer aggregate queries
- Python data pipelines that need compact time-series storage with pandas/numpy

**Not the right tool:**
- General-purpose file compression (use zstd)
- String or categorical data (STRIQ is numeric only)
- Lossless-only requirements with zero tolerance for approximation error
- Data without a timestamp axis (STRIQ assumes time-ordered input)
- Real-time OLAP workloads (use ClickHouse, TimescaleDB, DuckDB)

---

## Capabilities

| Operation | Description | Complexity |
|---|---|---|
| `mean`, `sum`, `min`, `max`, `count`, `variance` | Algebraic aggregates over any time range | O(blocks) |
| `mean_where(col, "> 25")` | Conditional aggregate with predicate shorthand | O(segments) |
| `value_at(ts)` | Point lookup at a single timestamp | O(1) block |
| `scan(columns)` | Extract rows for a time range as DataFrame | O(rows) |
| `downsample(col, n=200)` | N equidistant PLA-evaluated points as numpy arrays | O(segments) |
| `inspect(path)` | File metadata (rows, columns, epsilon, codecs) | O(1) |
| `verify(path)` | CRC-32C integrity check of every block | O(file) |

---

## API Reference

### Top-level Functions

| Function | Description |
|---|---|
| `striq.Writer(path, columns, epsilon)` | Create a .striq file |
| `striq.open(path)` / `striq.Reader(path)` | Open for querying |
| `striq.Store(columns, epsilon, cold_path)` | In-memory + optional cold file |
| `striq.from_dataframe(df, path)` | DataFrame to .striq |
| `striq.from_csv(csv, path)` | CSV to .striq |
| `striq.to_dataframe(path)` | .striq to DataFrame |
| `striq.inspect(path)` | File metadata |
| `striq.verify(path)` | CRC integrity check |

### Reader Query Methods

| Method | Returns |
|---|---|
| `r.mean(col)` | `Result` |
| `r.sum(col)` | `Result` |
| `r.min(col)` | `Result` |
| `r.max(col)` | `Result` |
| `r.variance(col)` | `Result` |
| `r.count()` | `int` |
| `r.mean_where(col, "> 25")` | `Result` |
| `r.downsample(col, n=200)` | `(timestamps, values)` numpy arrays |
| `r.value_at(ts)` | `(values, errors)` numpy arrays |
| `r.scan(columns)` | `pd.DataFrame` |

All query methods accept flexible time arguments: `ts_from`, `ts_to`, `since` (datetime/ISO/int), `last` (`"1h"`, `"7d"`).

### Store Methods

| Method | Description |
|---|---|
| `s.push(ts_ns, values)` | Append one row |
| `s.push_rows(timestamps, values)` | Batch append from numpy |
| `s.push_dataframe(df)` | Batch append from pandas |
| `s.mean(col)` / `s.min(col)` / `s.max(col)` | Aggregate over warm + cold |
| `s.count()` | Row count |
| `s.sync()` | Flush warm blocks to cold file |

---

## Requirements

- Python >= 3.11
- C11 compiler (clang, gcc, or MSVC)
- numpy >= 1.26
- Optional: pandas >= 2.1

Supported platforms: Linux, macOS (Apple Silicon + Intel), Windows.

---

## Links

- **Python bindings**: <https://github.com/NahumResearch/py-striq>
- **C library**: <https://github.com/NahumResearch/striq>

## License

Apache 2.0 — Nahum Ochoa. See [LICENSE](LICENSE).
