Metadata-Version: 2.4
Name: sparktimise
Version: 1.0.1
Summary: A Python wrapper that analyses DataFrames and applies optimisation techniques to maximise PySpark session performance.
Author: Keilan Evans
License: MIT
Keywords: pyspark,spark,optimisation,dataframe,RAP
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas>=1.5
Requires-Dist: pyarrow>=10.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: boto3>=1.26
Requires-Dist: botocore>=1.29
Requires-Dist: tomli>=2.0
Provides-Extra: dev
Requires-Dist: pytest>=7.4; extra == "dev"
Requires-Dist: pytest-cov>=4.1; extra == "dev"
Requires-Dist: ruff>=0.4; extra == "dev"
Requires-Dist: mypy>=1.8; extra == "dev"
Requires-Dist: black; extra == "dev"
Requires-Dist: pre-commit; extra == "dev"
Requires-Dist: types-PyYAML; extra == "dev"
Dynamic: license-file

# sparktimise

A PySpark optimisation library that inspects DataFrames and applies targeted performance improvements with minimal user code changes.

[![Python](https://img.shields.io/badge/python-3.10%2B-blue)]()
[![Code style: ruff](https://img.shields.io/badge/code%20style-ruff-000000)]()

## What The Package Does

sparktimise provides two ways to optimise PySpark jobs:

1. A decorator workflow through @optimise for automatic orchestration.
2. A functional workflow through analyse_* and optimise_* functions for explicit control.

| Capability | Description | Primary API |
|---|---|---|
| Schema optimisation | Detects downcast opportunities and safe string-to-type casts | optimise_schema |
| Partition optimisation | Estimates optimal shuffle partitions and low-cardinality partition candidates | optimise_partitions |
| Skew mitigation | Detects skewed keys and applies salting columns | optimise_skew |
| Cache strategy | Recommends and applies StorageLevel persistence | optimise_cache |
| Spark session tuning | Recommends and applies Spark SQL/session settings | optimise_auto_tuning / optimise_spark_session |
| Broadcast analysis | Profiles table sizes for join strategy advice and optional hints | analyse_broadcast / apply_broadcast_hints |
| Reporting | Summarises pipeline steps and metadata in text or dict form | OptimisationReport |

## Installation

```bash
pip install sparktimise
```

For local development:

```bash
pip install -e .[dev]
pip install pyspark
```

## Quick Start

### Decorator-first usage

```python
from pyspark.sql import SparkSession
from sparktimise.decorators import optimise

spark = SparkSession.builder.appName("orders-job").getOrCreate()

@optimise(
    schema=True,
    partitions=True,
    skew=["customer_id"],
    cache=True,
    auto_tuning=True,
    print_report=True,
)
def load_orders():
    return spark.read.parquet("s3a://my-bucket/orders/")

orders_df = load_orders()
```

### Functional usage

```python
from sparktimise.optimisation import optimise_schema, optimise_partitions

step1 = optimise_schema(df, sample_fraction=0.05)
step2 = optimise_partitions(step1.df, target_partition_bytes=134_217_728)

optimised_df = step2.df
print(step1.transformations_applied)
print(step2.transformations_applied)
```

### Pandas return values

If a decorated function returns a pandas DataFrame, you must provide spark to @optimise so coercion to Spark DataFrame can occur.

```python
import pandas as pd
from sparktimise.decorators import optimise

@optimise(schema=True, spark=spark)
def build_lookup() -> pd.DataFrame:
    return pd.DataFrame({"id": [1, 2, 3]})
```

## Configuration And Runtime Controls

The primary runtime configuration surface is the @optimise decorator.

| Parameter | Type | Default | Effect |
|---|---|---|---|
| schema | bool | True | Enables schema analysis and casting pipeline step |
| auto_tuning | bool | False | Applies Spark session recommendations |
| partitions | bool | True | Enables shuffle partition analysis and tuning |
| skew | list[str] \| None | None | Enables skew analysis/salting for listed columns |
| cache | bool | False | Enables persistence strategy step |
| sample_fraction | float \| None | None | Sampling fraction used by schema optimiser |
| parse_temporal | bool | True | Allows string to date/timestamp casting inference |
| max_string_columns | int \| None | None | Caps number of string columns profiled |
| target_partition_bytes | int \| None | None | Target partition size hint for partition optimiser |
| cache_storage_level | str \| None | None | Explicit cache level, for example MEMORY_AND_DISK |
| print_report | bool | False | Logs a formatted optimisation report |
| spark | SparkSession \| None | None | Session used for orchestration and pandas coercion |

For full configuration details, including file-backed config loading and Spark recommendation settings, see [docs/configuration.md](docs/configuration.md).

## Architecture And Process Flow

sparktimise follows a hybrid pattern:

1. Functional core: analyser and optimiser functions.
2. Imperative shell: @optimise orchestration.
3. OOP boundaries: adapters, config, and reporting.

Detailed architecture and sequence diagrams are documented in [docs/architecture.md](docs/architecture.md).

## Documentation Map

| Document | Purpose |
|---|---|
| [docs/README.md](docs/README.md) | Documentation index |
| [docs/usage.md](docs/usage.md) | End-to-end usage patterns and examples |
| [docs/configuration.md](docs/configuration.md) | Configuration variables and file formats |
| [docs/architecture.md](docs/architecture.md) | Internal design and process flow |
| [docs/troubleshooting.md](docs/troubleshooting.md) | Common setup/runtime/CI issues and fixes |
| [CHANGELOG.md](CHANGELOG.md) | Versioned release and change history |

## Development Setup

### Requirements

| Dependency | Purpose |
|---|---|
| Python 3.10+ | Runtime and tooling |
| Java (JRE/JDK) | Required for Spark tests |
| PySpark | Runtime dependency for Spark operations |

### Local setup

```bash
python -m pip install -U pip
python -m pip install -e .[dev]
python -m pip install pyspark
```

### Quality checks

```bash
ruff check src/sparktimise/ tests/
ruff format --check src/sparktimise/ tests/
mypy src/sparktimise/
```

### Tests

```bash
# Unit tests (default)
python -m pytest tests/unit/ --tb=short -q

# Integration tests (requires Java + Spark)
python -m pytest tests/integration/ --run-spark --spark-smoke-timeout 60 --tb=short -q
```

### Build package artifacts

```bash
python -m pip install build
python -m build --sdist --wheel
```

Artifacts are created under dist/.

## License

MIT
