Metadata-Version: 2.4
Name: maldi-tof-classifier
Version: 0.1.0
Summary: The maldi_tof_classifier package offers a CLI and Python 3 API for machine learning based classification of MALDI TOF spectra as measured by a Shimadzu 8300 MALDI-TOF mass spectrometer.
License: MIT License
        
        Copyright (c) 2026 Oliver Felix Matthias Klein
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
Classifier: License :: OSI Approved :: MIT License
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: imbalanced-learn>=0.0
Requires-Dist: joblib>=1.5.3
Requires-Dist: keras>=3.12.1
Requires-Dist: matplotlib>=3.10.8
Requires-Dist: numpy>=2.2.6
Requires-Dist: pandas>=2.3.3
Requires-Dist: pydantic>=2.12.5
Requires-Dist: pyyaml>=6.0.3
Requires-Dist: scikit-learn>=1.7.2
Requires-Dist: scipy>=1.15.3
Requires-Dist: seaborn>=0.13.2
Requires-Dist: tensorflow>=2.21.0
Requires-Dist: tqdm>=4.67.1
Requires-Dist: typer>=0.24.1
Requires-Dist: xgboost>=3.1.3
Dynamic: license-file

# maldi_tof_classifier

The **maldi_tof_classifier** package provides functionality for:
- Reading MALDI-TOF spectra
- Preprocessing spectral data
- Machine learning–based classification

It is designed for spectra generated by a **Shimadzu 8300 MALDI-TOF mass spectrometer**.

**License:** MIT

---

## Installation

    pip install maldi-tof-classifier

---

## Source Code

The source code of the project is available under: [GitHub: maldi-tof-classifier](https://github.com/ofmk94/maldi-tof-classifier)

---

## Overview

This README consists of two parts:

1. **CLI Tool Usage**
2. **Python API & Typical Workflows**

---

# Part 1 — CLI Tool Usage

# MALDI-TOF Classifier

**Version: 0.1**

The **MALDI-TOF Classifier** is a Typer-based CLI tool for classifying MALDI-TOF spectra as measured by a Shimadzu 8300 mass spectrometer.

The tool trains a pipeline model consisting of the following components:

- **Classifier**: a classification model instance
- **Scaler**: an instance of a data scaling model, e.g., `StandardScaler` *(optional)*
- **Mapper**: an instance of a spectral preprocessing model *(optional)*
- **PCA**: an instance of a PCA transformation model *(optional)*

---

## Requirements & Installation

The tool requires **Python ≥ 3.10**.

It is available as a public PyPI package and can be installed via:

    pip install maldi-tof-classifier

---

## Required Directory Structure

To use the tool, the following folder and file structure must be present:

    data_train/
        class_1/
            measurement1_class1.txt
        class_2/
            measurement1_class2.txt
        ...

    data_predict/
        measurement1.txt
        measurement2.txt

    cli_files/
        config.yaml

- `data_train`: contains subfolders for each class (`class_1`, `class_2`, etc.)
  - Each subfolder must directly contain the measurement text files
- `data_predict`: contains the spectra to be classified
- `cli_files`: must contain at least the file `config.yaml`

---

## Configuration

Model training parameters can be defined in `config.yaml`.

### Available settings

- **test_size**
  Fraction of training data used as test set.
  Float between 0.0 and 1.0.
  Default: `0.25`

- **classifier_cls**
  Instantiable classifier class.

  Available models:

  - From *scikit-learn*:
    `LogisticRegression`, `LinearDiscriminantAnalysis`, `QuadraticDiscriminantAnalysis`, `SVC`, `RandomForestClassifier`
  - From *xgboost*:
    `XGBClassifier`

  Default: `RandomForestClassifier`

- **classifier_params**
  Model-specific parameters.
  See official documentation of *scikit-learn* or *xgboost*.
  Default: `null`

- **scaler_cls**
  Optional scaling of spectral data.

  Available scalers (from *scikit-learn*):
  - `StandardScaler`
  - `MinMaxScaler`

Default: `null`

- **mapper_cls**
Instantiable mapper class for preprocessing input spectra:

- `null`: raw data is used directly
- `BinMapper`: maps measurement points into aggregated bins
- `PeakMapper`: clusters peaks into consensus clusters and maps spectra onto them

Default: `PeakMapper`

- **mapper_params** *(advanced use case)*
Dictionary of keyword arguments passed to the mapper.

**For `PeakMapper`:**
- `aggr_mode`
Mode of reducing values corresponding to a cluster.
Possible values: `max`, `mean`, `sum`
Default: `max`

- `peak_width`
Assumed maximum width (in indices) of a peak in the spectrum.
Integer. Default: `10`

- `peak_height`
Assumed minimum height (intensity) of a peak.
Float. Default: `0.5`

- `peak_prominence`
Minimum prominence to be identified as a peak.
Float. Default: `0.1`

- `merge_dist`
Distance threshold (in Da) for merging peaks assumed to represent the same signal.
Float. Default: `0.5`

- `freq_thresh`
Minimum relative frequency for a peak cluster to be considered a consensus cluster.
Float. Default: `0.5`

**For `BinMapper`:**
- `mz_cutoff`
Cutoff value for the m/z axis (in Da).
Float. Default: `null`

- `binning_distance`
Size of a bin in Da.
Float. Default: `0.5`

- `aggr_mode`
Mode of reducing values within a bin.
Possible values: `max`, `mean`, `sum`
Default: `max`

- **use_pca**
Enable PCA transformation.
Default: `null`

- **pca_n_components**
Number of PCA components (if PCA is enabled)

---

All parameters are optional.
If at least an empty `cli_files/config.yaml` file exists, the tool can be used with default parameters.

---

## Example Data

Example data, including a sample `config.yaml`, is available at:

[GitHub: maldi-tof-classifier-data](https://github.com/ofmk94/maldi-tof-classifier-data)

---

## CLI Commands

### `mtc train`

- Trains the pipeline model using:
- data from `data_train`
- parameters defined in `cli_files/config.yaml`

- Saves the trained pipeline to:

    cli_files/pipeline.joblib

- Writes performance metrics to:

    cli_files/performance_scores.txt

Metrics include:
- Accuracy
- Precision
- Recall
- F1-score
- Confusion matrix

---

### `mtc predict`

- Predicts classes for files in `data_predict`

- Writes results to:

    cli_files/predictions.csv

Output includes:
- predicted class
- corresponding filename

---

# Part 2 — Python API & Workflow

Example files generated by a Shimadzu 8300 mass spectrometer can be found here [GitHub: maldi-tof-classifier-data](https://github.com/ofmk94/maldi-tof-classifier-data).

---

## Step 1 — Extract Raw Spectra

### Training Data Mode

Training data must be structured as:


    data_train/
        class_1/
            measurement1_class1.txt
        class_2/
            measurement1_class2.txt
        ...

- Folder names correspond to class labels
- Spot names (sample locations) are extracted via regular expressions from filenames
  → If extraction fails: `"n/a"`
- The m/z axis (in Da) is extracted from the first valid measurement file

Code example:

    from pathlib import Path
    from maldi_tof_classifier.core import RawSpectraExtractor

    TRAIN_DIR = Path(".") / "data_train"

    extractor = RawSpectraExtractor()

    spectra, class_labels, spots, mz_axis = extractor.extract_train_data(TRAIN_DIR)

---

### Prediction Mode

Prediction data must be structured as:

    data_predict/
        measurement1.txt
        measurement2.txt


Only spectra and filenames are extracted (used later for output referencing):

    from pathlib import Path
    from maldi_tof_classifier.core import RawSpectraExtractor

    PRED_DIR = Path(".") / "data_predict"

    extractor = RawSpectraExtractor()

    spectra, filenames = extractor.extract_predict_data(PRED_DIR)

---

## Step 2 — Train/Test Split & Label Encoding

    from sklearn.model_selection import train_test_split

    X_train, X_test, y_train, y_test = train_test_split(
        spectra, class_labels, test_size=0.25
    )

Encode labels:

    from sklearn.preprocessing import LabelEncoder

    le = LabelEncoder()

    y_train = le.fit_transform(y_train)
    y_test = le.transform(y_test)

---

## Step 3 — Handle Class Imbalance (Optional)

    from imblearn.over_sampling import SMOTE

    smote = SMOTE()

    X_train, y_train = smote.fit_resample(X_train, y_train)

---

## Step 4 — Scaling (Optional)

    from sklearn.preprocessing import StandardScaler

    scaler = StandardScaler()

    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)

---

## Step 5 — Spectral Preprocessing (Optional)

### BinMapper

Maps spectra into bins along the m/z axis.

- `mz_axis`: extracted earlier
- `mz_cutoff`: float or None (default: None)
- `binning_distance`: float (default: 0.5 Da)
- `aggr_mode`: `max`, `mean`, `sum` (default: `max`)

Code example:

    from maldi_tof_classifier.core.mapper import BinMapper

    mapper = BinMapper(mz_axis=mz_axis)

    X_train = mapper.fit_transform(X_train)
    X_test = mapper.transform(X_test)

---

### PeakMapper

Extracts peaks and builds consensus peak regions across spectra.

**Parameters:**

- `mz_axis`: m/z axis
- `aggr_mode`: `max`, `mean`, `sum` (default: `max`)
- `peak_width`: int (default: 10)
- `peak_height`: float (default: 0.5)
- `peak_prominence`: float (default: 0.1)
- `merge_dist`: float (default: 0.5)
- `freq_thresh`: float (default: 0.5)

Code example:

    from maldi_tof_classifier.core.mapper import PeakMapper

    mapper = PeakMapper(mz_axis=mz_axis)

    X_train = mapper.fit_transform(X_train)
    X_test = mapper.transform(X_test)

---

## Step 6 — Dimensionality Reduction (Optional)

    from sklearn.decomposition import PCA

    pca = PCA(n_components=20)

    X_train = pca.fit_transform(X_train)
    X_test = pca.transform(X_test)

---

## Step 7 — Classification

### RandomForestClassifier

    from sklearn.ensemble import RandomForestClassifier

    classifier = RandomForestClassifier()

    classifier.fit(X_train, y_train)

    y_pred = classifier.predict(X_test)

---

### XGBClassifier

    from xgboost import XGBClassifier

    classifier = XGBClassifier()

    classifier.fit(X_train, y_train)

    y_pred = classifier.predict(X_test)

---

## Neural Network Models

Available in `maldi_tof_classifier.nn`:

- `CNN1DClassifier`
- `LSTMClassifier`

These functions act as factories that return precompiled models.

---

### Train / Validation / Test Split

    from sklearn.model_selection import train_test_split

    X_train, X_val_test, y_train, y_val_test = train_test_split(
        spectra, class_labels, test_size=0.3
    )

    X_val, X_test, y_val, y_test = train_test_split(
        X_val_test, y_val_test, test_size=0.333
    )

---

### One-Hot Encoding

    from tensorflow.keras.utils import to_categorical

    n_classes = y_train.max() + 1

    y_train = to_categorical(y_train, n_classes)
    y_val = to_categorical(y_val, n_classes)
    y_test = to_categorical(y_test, n_classes)

---

### Example: CNN1DClassifier

    from maldi_tof_classifier.nn import CNN1DClassifier

    model = CNN1DClassifier(X_train, y_train)

    model.fit(
        X_train,
        y_train,
        epochs=20,
        validation_data=(X_val, y_val)
    )

    y_pred = model.predict(X_test)

---

## Pipeline API

Steps 4–7 can be combined into a pipeline using:

    from maldi_tof_classifier.core import generate_pipeline

---

### Pipeline Components

- Scaler (optional)
- Mapper (optional)
- PCA (optional)
- Classifier (required)

---
### Parameters

- `classifier_cls`
  Instantiable class of the classifier used in the pipeline.
  Must implement `fit(X, y)` and `predict(X)` following Scikit-learn conventions.

- `classifier_params`
  Parameters used to initialize the classifier.
  Passed as `classifier_cls(**classifier_params)`.

- `scaler_cls`
  Instantiable class of the scaler (e.g. `StandardScaler`, `MinMaxScaler`) used in the pipeline.
  Must implement `fit(X, y=None)` and `transform(X)` following Scikit-learn conventions.
  If `None`, no scaling is applied.

- `mapper_cls`
  Instantiable class of the mapper (e.g. `BinMapper`, `PeakMapper`) used in the pipeline.
  Must implement `fit(X, y=None)` and `transform(X)` following Scikit-learn conventions.
  If `None`, no mapping is applied.

- `mapper_params`
  Parameters used to initialize the mapper.
  Passed as `mapper_cls(**mapper_params)`.
  Must be provided if `mapper_cls` is not `None`.
  When using `BinMapper` or `PeakMapper`, `mz_axis` is required.

- `use_pca`
  Specifies whether PCA dimensionality reduction is applied in the pipeline.

- `pca_n_components`
  Number of components used to initialize the PCA object.
  `int | None`, default: `None`.
  Only used if `use_pca=True`.
---

### Example

    from sklearn.preprocessing import StandardScaler
    from maldi_tof_classifier.core.mapper import PeakMapper
    from sklearn.ensemble import RandomForestClassifier

    from maldi_tof_classifier.core import generate_pipeline

    mapper_params = {
        "mz_axis": mz_axis
    }

    pipeline = generate_pipeline(
        classifier_cls=RandomForestClassifier,
        scaler_cls=StandardScaler,
        mapper_cls=PeakMapper,
        mapper_params=mapper_params,
        use_pca=True,
        pca_n_components=20
    )

    pipeline.fit(X_train, y_train)

    y_pred = pipeline.predict(X_test)

---

## Author

Oliver Klein \
oliver.klein@stud.hcw.ac.at \
oliverfmklein@gmail.com

---

## License

This project is licensed under the MIT License.

Copyright (c) 2026 Oliver Felix Matthias Klein (GitHub username: ofmk94)

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

## Disclaimer

This README was generated from the original German raw text, translated into English, stylistically revised, and converted into Markdown format using ChatGPT (Model 5.3, March 2026).

No liability is assumed for the provided software or its contents.

---

_Last edited: March 29th, 2026_



