Metadata-Version: 2.4
Name: steer-learn
Version: 0.1.0
Summary: SKLearn extention for Markov models
Author-email: Miguel <miguel.melo@driverevel.com>
License: MIT
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: scikit-learn
Requires-Dist: numpy

# steer_learn

`steer_learn` is a lightweight **scikit-learn-compatible extension** for transition-based and sequence-aware learning.
It provides estimators for **absorbing Markov chains** and **hidden Markov models (HMMs)**, plus small preprocessing utilities for path-like sequential data.

The package is designed to feel familiar to scikit-learn users:

- estimators inherit from `BaseEstimator`
- predictive models follow the `fit` / `predict` / `predict_proba` API
- transformers implement `fit` / `transform`
- components are easy to compose in sklearn-style workflows where their input/output contracts fit the task

## Why `steer_learn`?

Many real-world systems are naturally described as **state transitions**:

- navigation funnels
- user journey analysis
- workflow completion paths
- process mining
- discrete state machines
- sequence classification with previous-state context

`steer_learn` focuses on these settings with a simple API and minimal dependencies.

## Features

### Estimators

- **`MarkovAbsorbingModel`**  
  Learns transition probabilities for a Markov chain with absorbing states and predicts the most likely absorbing outcome.

- **`HiddenMarkovModel`**  
  A discrete-emission hidden Markov model classifier that combines transition probabilities with per-dimension categorical emission probabilities.

- **`GaussianHiddenMarkovModel`**  
  A continuous-emission hidden Markov model classifier that uses multivariate Gaussian emissions.

### Transformers

- **`PathSplitter`**  
  Splits string paths into tokenized state sequences.

- **`PathFlanker`**  
  Prefixes each sequence with a start token and appends the aligned target value from `y`, making it a **supervised sequence-preparation transformer**.

- **`TransitionRoller`**  
  Converts sequences into pairwise transitions.

## Installation

For local development:

```bash
pip install steer_learn
```

## Quick start

### 1. Absorbing Markov chain classification

Use `MarkovAbsorbingModel` when each sample represents a transition from one state to the next, encoded as one-hot vectors.

```python
import numpy as np
from steer_learn import MarkovAbsorbingModel

# 3 states: A, B, C
# X = source state, y = target state
X = np.array([
    [1, 0, 0],
    [1, 0, 0],
    [0, 1, 0],
    [0, 1, 0],
])

y = np.array([
    [0, 1, 0],
    [0, 1, 0],
    [0, 0, 1],
    [0, 0, 1],
])

clf = MarkovAbsorbingModel()
clf.fit(X, y)

# Predict the final absorbing destination
clf.predict(np.array([
    [1, 0, 0],
    [0, 1, 0],
]))

# Probability of each absorbing state
clf.predict_proba(np.array([
    [1, 0, 0],
    [0, 1, 0],
]))
```

### 2. Discrete hidden Markov model

Use `HiddenMarkovModel` when:

- the **first column** of `X` is the previous state
- the remaining columns are **discrete observed symbols**
- `y` is the current hidden state label

```python
import numpy as np
from steer_learn import HiddenMarkovModel

# X columns: [previous_state, symbol_1, symbol_2]
X = np.array([
    [0, 1, 2],
    [0, 1, 1],
    [1, 0, 2],
    [1, 0, 1],
])

y = np.array([0, 0, 1, 1])

clf = HiddenMarkovModel()
clf.fit(X, y)

pred = clf.predict(X)
proba = clf.predict_proba(X)
```

### 3. Gaussian hidden Markov model

Use `GaussianHiddenMarkovModel` when observations are continuous.

```python
import numpy as np
from steer_learn import GaussianHiddenMarkovModel

# X columns: [previous_state, feature_1, feature_2]
X = np.array([
    [0, 0.2, 1.1],
    [0, 0.1, 0.9],
    [1, 2.2, 3.0],
    [1, 2.0, 2.8],
])

y = np.array([0, 0, 1, 1])

clf = GaussianHiddenMarkovModel()
clf.fit(X, y)

pred = clf.predict(X)
proba = clf.predict_proba(X)
```

## API design

`steer_learn` follows the core conventions recommended for scikit-learn-compatible extensions:

- **Estimator interface**: predictive models implement `fit`, `predict`, and when available `predict_proba`.
- **Transformer interface**: preprocessing components implement `fit` and `transform`.
- **Composable objects**: estimators inherit from `BaseEstimator`, which provides parameter inspection utilities such as `get_params` and `set_params`.
- **Return `self` from `fit`**: all fitted estimators return the estimator instance.
- **NumPy-first inputs**: examples use NumPy arrays and sklearn-style tabular inputs.

This makes the project easier to understand for users already familiar with the scikit-learn ecosystem and simplifies future integration with tooling such as pipelines, search utilities, and model evaluation helpers.

## Available objects

### Top-level imports

```python
from steer_learn import (
    MarkovAbsorbingModel,
    HiddenMarkovModel,
    GaussianHiddenMarkovModel,
)
```

### Transformer utilities

```python
from steer_learn.transformer import (
    PathSplitter,
    PathFlanker,
    TransitionRoller,
)
```

## Input conventions

### `MarkovAbsorbingModel`

- `X`: 2D array of one-hot encoded **source** states
- `y`: 2D array of one-hot encoded **target** states
- `predict(X)`: returns the most likely absorbing state index
- `predict_proba(X)`: returns absorbing-state probabilities

### `HiddenMarkovModel`

- `X[:, 0]`: previous state index
- `X[:, 1:]`: discrete emission symbols
- `y`: target state index
- `predict_proba(X)`: returns unnormalized state scores from transition × emission terms

### `GaussianHiddenMarkovModel`

- `X[:, 0]`: previous state index
- `X[:, 1:]`: continuous-valued observation vector
- `y`: target state index
- `predict_proba(X)`: returns Gaussian emission density × transition scores

### `PathSplitter`

- `X`: iterable of path strings
- output: tokenized sequences split on `sep`

### `PathFlanker`

- parameter: `start_state="<START>"`
- `X`: iterable of tokenized sequences
- `y`: iterable of aligned target labels
- output: each sample becomes `[start_state] + list(x) + [target]`

Example:

```python
from steer_learn.transformer import PathFlanker

X = [["Landing", "Signup"], ["Landing", "Pricing"]]
y = ["Activated", "Churned"]

flanker = PathFlanker(start_state="<START>")
X_flanked = flanker.fit_transform(X, y)
# ["<START>", "Landing", "Signup", "Activated"]
# ["<START>", "Landing", "Pricing", "Churned"]
```

### `TransitionRoller`

- `X`: iterable of sequences
- output: 2-column transition pairs extracted from consecutive positions


## Use cases

`steer_learn` is a good fit for problems such as:

- predicting terminal states in a process
- estimating next-state probabilities
- modeling user or system trajectories
- sequence-aware classification with previous-state context
- transforming string-based paths into transition datasets
- preparing supervised transition sequences where the final label is appended to the observed path

## Notes on `PathFlanker`

`PathFlanker` now behaves differently from a typical unsupervised preprocessing transformer:

- it uses `y` during `transform`, not just during `fit`
- it appends the aligned target label to the end of each sequence
- it is best understood as a **training-data preparation step** for labeled sequence problems

This is still compatible with the general sklearn estimator style, but it is less conventional than a pure `transform(X)` preprocessing step. In practice, it is most useful in custom preprocessing workflows, dataset construction code, or supervised sequence pipelines where `y` is intentionally available at transform time.

## Example workflow for path data

A typical workflow for journey or state-path data may look like this:

```python
from steer_learn.transformer import PathSplitter, PathFlanker, TransitionRoller

paths = [
    "Landing -> Signup",
    "Landing -> Pricing",
]
labels = ["Activated", "Churned"]

splitter = PathSplitter(sep="->")
flanker = PathFlanker(start_state="<START>")
roller = TransitionRoller()

X = splitter.fit_transform(paths)
X = flanker.fit_transform(X, labels)
transitions = roller.transform(X)
```

This produces transitions over sequences that begin with `<START>` and end with the aligned target label, which is useful when the target should be modeled as the final state in the path.

## Notes and current scope

This project is intentionally compact and focused. It currently emphasizes:

- educational clarity
- sklearn-style APIs
- transition-based modeling primitives
- simple NumPy-backed implementations


## Contributing

Contributions are welcome. A good contribution should preserve the project's core design goals:

- keep the API intuitive for scikit-learn users
- document estimator inputs and outputs clearly
- prefer readable, testable implementations
- maintain consistent naming and import patterns


---

If you use `steer_learn` in research, analytics, or production experiments, consider documenting the exact state encoding, label semantics, and sequence assumptions used in your preprocessing pipeline.
