Metadata-Version: 2.4
Name: tidymut
Version: 0.2.0
Summary: An efficient framework for tidying and standardizing protein mutation data.
Author-email: Yuxiang Tang <845351766@qq.com>
License: BSD 3-Clause License
        
        Copyright (c) 2025, Yuxiang Tang.
        
        Redistribution and use in source and binary forms, with or without
        modification, are permitted provided that the following conditions are met:
        
        1. Redistributions of source code must retain the above copyright notice, this
           list of conditions and the following disclaimer.
        
        2. Redistributions in binary form must reproduce the above copyright notice,
           this list of conditions and the following disclaimer in the documentation
           and/or other materials provided with the distribution.
        
        3. Neither the name of the copyright holder nor the names of its
           contributors may be used to endorse or promote products derived from
           this software without specific prior written permission.
        
        THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
        AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
        IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
        DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
        FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
        DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
        SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
        CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
        OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
        OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Project-URL: Repository, https://github.com/xulab-research/TidyMut
Keywords: protein,mutation,tidy,framework,pipeline
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Operating System :: OS Independent
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Typing :: Typed
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: joblib>=1.5.0
Requires-Dist: numpy>=2.1.0
Requires-Dist: pandas>=2.1.0
Requires-Dist: tqdm>=4.60.0
Requires-Dist: python-dateutil>=2.8.2
Requires-Dist: tzdata>=2022.7
Requires-Dist: requests>=2.30
Provides-Extra: test
Requires-Dist: pytest>=8.0.0; extra == "test"
Requires-Dist: pytest-cov>=6.0.0; extra == "test"
Provides-Extra: dev
Requires-Dist: pytest>=8.0.0; extra == "dev"
Requires-Dist: pytest-cov>=6.0.0; extra == "dev"
Requires-Dist: sphinx>=8.0.0; extra == "dev"
Requires-Dist: sphinx-autobuild>=2024.10.0; extra == "dev"
Requires-Dist: sphinx_rtd_theme>=3.0.0; extra == "dev"
Requires-Dist: twine>=6.0.0; extra == "dev"
Requires-Dist: gitpython; extra == "dev"
Requires-Dist: pygithub; extra == "dev"
Requires-Dist: jinja2; extra == "dev"
Dynamic: license-file

# TidyMut

A comprehensive Python package for processing and analyzing biological sequence data with advanced mutation analysis capabilities.

## Overview

TidyMut is designed for bioinformaticians, computational biologists, and researchers working with genetic sequence data. The package streamlines the complex process of cleaning, processing, and analyzing DNA and protein sequences, with specialized tools for mutation analysis and large-scale dataset handling.

### Key Capabilities

- **Sequence Data Processing**: Comprehensive support for DNA and protein sequence operations including complementation, transcription, translation, and validation
- **Advanced Mutation Analysis**: Specialized tools for detecting, analyzing, and characterizing genetic mutations with statistical insights
- **Intelligent Data Cleaning**: Automated preprocessing pipelines that handle common data quality issues in biological datasets
- **Flexible Pipeline Architecture**: Modular design allowing custom workflow creation for specific research needs
- **High-Performance Processing**: Optimized for handling large-scale sequence datasets efficiently

## Installation

### Requirements
- Python 3.13+
- pandas
- tqdm

### Install via pip
```bash
pip install tidymut
```

### Development Installation
```bash
git clone https://github.com/xulab-research/TidyMut.git tidymut
cd tidymut
pip install -e .
```

## Quick Start

### Processing cDNAProteolysis Dataset

Here's a complete example demonstrating TidyMut's capabilities with the cDNAProteolysis mutation dataset:

```python
from tidymut import download_cdna_proteolysis_source_file
from tidymut import cdna_proteolysis_cleaner


# Create cDNAProteolysis cleaning pipeline using TidyMut's default pipeline
cdna_proteolysis_filepath = download_cdna_proteolysis_source_file("dir_path", "file_name")["filename"]

cdna_proteolysis_cleaning_pipeline = cdna_proteolysis_cleaner.create_cdna_proteolysis_cleaner(
    cdna_proteolysis_filepath
)

# Clean and process the dataset 
cdna_proteolysis_cleaning_pipeline, cdna_proteolysis_dataset = \
    cdna_proteiolysis_cleaner.clean_cdna_proteolysis_dataset(cdna_proteolysis_cleaning_pipeline)

# Save the processed dataset
cdna_proteolysis_dataset.save("output/cleaned_cdna_proteolysis_data")
```

### Basic Sequence Operations

```python
from tidymut.sequence import DNASequence, ProteinSequence

# DNA sequence analysis
dna = DNASequence("ATGCGATCGTAGC")
print(f"Complement: {dna.complement()}")
print(f"Reverse complement: {dna.reverse_complement()}")
print(f"Translation: {dna.translate()}")
```

## Core Features

### Sequence Data Manipulation
- **Sequence Validation**: Automatic detection and correction of common sequence errors
- **Format Conversion**: Seamless conversion between different sequence formats
- **Batch Processing**: Efficient handling of large sequence collections

### Mutation Analysis
- **Mutation Detection**: Automated identification of point mutations, insertions, and deletions
- **Statistical Analysis**: Comprehensive mutation frequency and distribution statistics
- **Visualization Tools**: Built-in plotting functions for mutation landscapes

### Data Cleaning & Preprocessing
- **Standardization**: Consistent sequence formatting and annotation
- **Duplicate Removal**: Intelligent handling of redundant sequences

### Pipeline Architecture
- **Modular Design**: Mix and match processing components
- **Parallel Processing**: Multi-core support for large datasets
- **Progress Tracking**: Real-time processing status and logging

## Examples and Use Cases

### Custom Processing Pipeline
```python
import pandas as pd

from tidymut.cleaners.basic_cleaners import (
    extract_and_rename_columns,
    filter_and_clean_data,
    convert_data_types,
    validate_mutations,
    infer_wildtype_sequences,
    convert_to_mutation_dataset_format,
)
from tidymut.core.dataset import MutationDataset
from tidymut.core.pipeline import Pipeline, create_pipeline

dataset = pd.read_csv("path/to/Tsuboyama2023_Dataset2_Dataset3_20230416.csv")

pipeline = create_pipeline(dataset, "cnda_proteolysis_cleaner")
clean_result = (
    pipeline.then(
        extract_and_rename_columns,
        column_mapping={
            "WT_name": "name",
            "aa_seq": "mut_seq",
            "mut_type": "mut_info",
            "ddG_ML": "ddG",
        },
    )
    .then(filter_and_clean_data, filters={"ddG": lambda x: x != "-"})
    .then(convert_data_types, type_conversions={"ddG": "float"})
    .then(
        validate_mutations,
        mutation_column="mut_info",
        mutation_sep="_",
        is_zero_based=False,
        num_workers=16,
    )
    .then(
        infer_wildtype_sequences,
        label_columns=["ddG"],
        handle_multiple_wt="error",
        is_zero_based=True,
        num_workers=16,
    )
    .then(
        convert_to_mutation_dataset_format,
        name_column="name",
        mutation_column="mut_info",
        mutated_sequence_column="mut_seq",
        score_column="ddG",
        is_zero_based=True,
    )
)
cdna_proteolysis_dataset_df, cdna_proteolysis_ref_seq = clean_result.data
cdna_proteolysis_dataset = MutationDataset.from_dataframe(
    cdna_proteolysis_dataset_df, cdna_proteolysis_ref_seq
)

# Get execution summary
execution_info = pipeline.get_execution_summary()

# Access artifacts
artifacts = pipeline.artifacts

# Save pipeline state
pipeline.save_structured_data("cdna_proteolysis_cleaner_pipeline.pkl")
```

## Citation

If you use TidyMut in your research, please cite:

```bibtex
@software{tidymut,
  title={TidyMut: A Python Package for Biological Sequence Data Processing},
  author={Your Name and Contributors},
  year={2025},
  url={https://github.com/xulab-research/tidymut}
}
```

## License

This project is licensed under the BSD 3-Clause License - see the [LICENSE](LICENSE) file for details.

## Support

- **Issues**: [GitHub Issues](https://github.com/xulab-research/tidymut/issues)
- **Discussions**: [GitHub Discussions](https://github.com/xulab-research/tidymut/discussions)
