Metadata-Version: 2.4
Name: scd-matching-plugin
Version: 0.1.0
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Rust
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Programming Language :: Python :: Implementation :: PyPy
Classifier: Topic :: Scientific/Engineering :: Medical Science Apps.
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Requires-Dist: polars==1.31.*
Requires-Dist: pyarrow>=10.0.0
License-File: LICENSE
Summary: High-performance Rust-based plugin for Polars that performs epidemiologically sound case-control matching with proper risk-set sampling methodology
Keywords: epidemiology,case-control,matching,polars,rust,scd
Author: Tobias Kragholm
License: MIT
Requires-Python: >=3.12
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM

# SCD Polars Matching Plugin

A high-performance Rust-based plugin for Polars that performs epidemiologically sound case-control matching with proper risk-set sampling methodology to avoid immortal time bias.

## Overview

This plugin implements time-to-event methodology for matching severe chronic disease (SCD) cases with controls, processing cases chronologically and ensuring controls are eligible at the time of each case's diagnosis.

## Installation

```bash
pip install scd-polars-matching-plugin
```

Or build from source:
```bash
maturin develop
```

## Usage

```python
from matching_plugin import complete_scd_matching_workflow

# Perform complete matching workflow
matched_df = complete_scd_matching_workflow(
    mfr_data=mfr_df,
    lpr_data=lpr_df,
    vital_data=vital_df,  # Optional
    matching_ratio=5,
    birth_date_window_days=30,
    parent_birth_date_window_days=365,
    match_parent_birth_dates=True,
    match_parity=True
)
```

## Input Data Formats

### MFR Data (Birth Registry)
Required columns:
- `PNR`: Person identifier (string)
- `FOEDSELSDATO`: Birth date (date)
- `CPR_MODER`: Mother's identifier (string)
- `CPR_FADER`: Father's identifier (string)
- `MODER_FOEDSELSDATO`: Mother's birth date (date)
- `FADER_FOEDSELSDATO`: Father's birth date (date)
- `PARITET`: Birth order/parity (integer)

**Example MFR Data:**
```
┌─────────────┬──────────────┬─────────────┬─────────────┬────────────────────┬────────────────────┬─────────┐
│ PNR         ┆ FOEDSELSDATO ┆ CPR_MODER   ┆ CPR_FADER   ┆ MODER_FOEDSELSDATO ┆ FADER_FOEDSELSDATO ┆ PARITET │
├─────────────┼──────────────┼─────────────┼─────────────┼────────────────────┼────────────────────┼─────────┤
│ person_0001 ┆ 1995-01-15   ┆ mother_0001 ┆ father_0001 ┆ 1970-03-22         ┆ 1968-07-10         ┆ 1       │
│ person_0002 ┆ 1995-02-20   ┆ mother_0002 ┆ father_0002 ┆ 1972-11-15         ┆ 1969-05-03         ┆ 2       │
│ person_0003 ┆ 1995-03-10   ┆ mother_0003 ┆ father_0003 ┆ 1973-08-07         ┆ 1971-12-25         ┆ 1       │
└─────────────┴──────────────┴─────────────┴─────────────┴────────────────────┴────────────────────┴─────────┘
```

### LPR Data (Patient Registry)
Required columns:
- `PNR`: Person identifier (string)
- `SCD_STATUS`: Disease status ("SCD", "SCD_LATE", "NO_SCD")
- `SCD_DATE`: Diagnosis date (date, null for non-cases)
- `ICD_CODE`: Diagnosis code (string, optional)

**Example LPR Data:**
```
┌─────────────┬────────────┬────────────┬──────────┐
│ PNR         ┆ SCD_STATUS ┆ SCD_DATE   ┆ ICD_CODE │
├─────────────┼────────────┼────────────┼──────────┤
│ person_0001 ┆ SCD        ┆ 1997-06-15 ┆ D57.1    │
│ person_0002 ┆ NO_SCD     ┆ null       ┆ null     │
│ person_0003 ┆ SCD_LATE   ┆ 2001-03-22 ┆ D57.0    │
│ person_0004 ┆ NO_SCD     ┆ null       ┆ null     │
└─────────────┴────────────┴────────────┴──────────┘
```

### Vital Events Data (Optional)
Required columns:
- `PNR`: Person identifier (string)
- `EVENT`: Event type ("DEATH", "EMIGRATION")
- `EVENT_DATE`: Event date (date)
- `ROLE`: Individual role ("CHILD", "PARENT")

**Example Vital Events Data:**
```
┌─────────────┬────────────┬────────────┬────────┐
│ PNR         ┆ EVENT      ┆ EVENT_DATE ┆ ROLE   │
├─────────────┼────────────┼────────────┼────────┤
│ person_0001 ┆ EMIGRATION ┆ 1999-12-01 ┆ CHILD  │
│ mother_0002 ┆ DEATH      ┆ 1998-07-15 ┆ PARENT │
│ person_0004 ┆ DEATH      ┆ 2000-03-10 ┆ CHILD  │
│ father_0001 ┆ EMIGRATION ┆ 1997-11-20 ┆ PARENT │
└─────────────┴────────────┴────────────┴────────┘
```

### Data Relationships
- **MFR and LPR**: Must be joined on `PNR` to combine birth registry and patient data
- **Vital Events**: Optional supplementary data that tracks death/emigration events
- **Parent Links**: `CPR_MODER` and `CPR_FADER` in MFR link to parent `PNR` values in vital events
- **Temporal Logic**: All dates must be proper date types for chronological processing

## Output Format

The function returns a Polars DataFrame with the following columns:

- `MATCH_INDEX`: Unique identifier for each case-control group (integer)
- `PNR`: Person identifier (string)
- `ROLE`: Individual role in the match ("case" or "control")
- `INDEX_DATE`: SCD diagnosis date from the case (date)

### Example Output
```
┌─────────────┬─────────────┬─────────┬────────────┐
│ MATCH_INDEX ┆ PNR         ┆ ROLE    ┆ INDEX_DATE │
├─────────────┼─────────────┼─────────┼────────────┤
│ 1           ┆ person_0001 ┆ case    ┆ 1997-01-01 │
│ 1           ┆ person_0002 ┆ control ┆ 1997-01-01 │
│ 1           ┆ person_0003 ┆ control ┆ 1997-01-01 │
│ 2           ┆ person_0004 ┆ case    ┆ 1997-06-15 │
│ 2           ┆ person_0005 ┆ control ┆ 1997-06-15 │
└─────────────┴─────────────┴─────────┴────────────┘
```

## Key Features

### Risk-Set Sampling
- **Chronological Processing**: Cases are processed in order of diagnosis date
- **Temporal Validity**: Controls must be eligible (alive, present, undiagnosed) at case diagnosis time
- **No Immortal Time Bias**: Future SCD cases can serve as controls for earlier cases

### Matching Criteria
- **Birth Date Window**: Match controls within specified days of case birth date
- **Parent Birth Dates**: Optional matching on parental birth dates with configurable windows
- **Parity Matching**: Optional matching on birth order
- **Vital Status**: Optional incorporation of death/emigration events

### Performance
- **Rust Implementation**: High-performance core algorithms
- **Polars Integration**: Seamless integration with Polars DataFrames
- **Memory Efficient**: Optimized for large datasets

## Parameters

- `matching_ratio`: Number of controls per case (default: 5)
- `birth_date_window_days`: Maximum birth date difference in days (default: 30)
- `parent_birth_date_window_days`: Maximum parent birth date difference (default: 365)
- `match_parent_birth_dates`: Enable parent birth date matching (default: True)
- `match_mother_birth_date_only`: Match only maternal birth dates (default: False)
- `require_both_parents`: Require both parents for matching (default: False)
- `match_parity`: Enable parity matching (default: True)

## License

This project is licensed under the MIT License.

