Metadata-Version: 2.4
Name: custom-llm-eval
Version: 0.1.0
Summary: A comprehensive framework for evaluating Large Language Models with built-in support for bias, toxicity, relevancy metrics, custom evaluations, and conversational test cases
Home-page: https://github.com/yourusername/custom-llm-eval
Author: Your Name
Author-email: atulbmysuru@gmail.com
License: MIT
Keywords: llm,evaluation,deepeval,ai,testing,bias,toxicity,nlp
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Testing
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: deepeval>=0.21.0
Requires-Dist: requests>=2.28.0
Requires-Dist: python-dotenv>=0.19.0
Dynamic: author
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-python

# Custom LLM Eval

A comprehensive framework for evaluating Large Language Models with built-in support for bias, toxicity, relevancy metrics, custom evaluations, and conversational test cases. Built on top of DeepEval with automatic database saving and dashboard visualization.

## Features

- **Multiple Evaluation Metrics**:
  - Bias Detection
  - Toxicity Analysis
  - Answer Relevancy
  - Custom Evaluations (using GEval)
  - Conversational Evaluations (multi-turn conversations)

- **Database Integration**: Automatic saving of test cases and evaluation results to a database via REST API

- **Test Case Management**: Create, store, and retrieve test cases

- **Dashboard Ready**: Structured data output for visualization dashboards

## Installation

```bash
pip install custom-llm-eval
```

## Quick Start

```python
from custom_llm_eval import LLMEvaluator
from deepeval.models import GeminiModel
from deepeval.test_case import LLMTestCase

# Initialize evaluator with your LLM model
model = GeminiModel(model="gemini-2.0-flash-exp")
evaluator = LLMEvaluator(
    llm=model,
    test_suite_name="My Test Suite",
    cluster_name="production"
)

# Create a test case
test_case = LLMTestCase(
    input="What is machine learning?",
    actual_output="Machine learning is a subset of AI that enables systems to learn from data."
)

# Run evaluations
bias_result = evaluator.evaluate_bias(test_case, threshold=0.5)
toxicity_result = evaluator.evaluate_toxicity(test_case, threshold=0.5)
relevancy_result = evaluator.evaluate_answer_relevancy(test_case, threshold=0.7)

print(f"Bias Score: {bias_result['score']}, Passed: {bias_result['passed']}")
print(f"Toxicity Score: {toxicity_result['score']}, Passed: {toxicity_result['passed']}")
print(f"Relevancy Score: {relevancy_result['score']}, Passed: {relevancy_result['passed']}")
```

## Advanced Usage

### Custom Evaluations

```python
from deepeval.test_case import LLMTestCaseParams

# Define custom evaluation criteria
result = evaluator.custom_eval(
    name="Code Quality",
    test_case=test_case,
    criteria="Evaluate the code for readability, efficiency, and best practices",
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
    threshold=0.7
)
```

### Conversational Evaluations

```python
from deepeval.test_case import ConversationalTestCase, Turn

# Create conversational test case
conv_test_case = ConversationalTestCase(
    turns=[
        Turn(role="user", content="Hello!"),
        Turn(role="assistant", content="Hi! How can I help you?"),
        Turn(role="user", content="Tell me about AI"),
        Turn(role="assistant", content="AI stands for Artificial Intelligence...")
    ],
    scenario="Customer support conversation"
)

# Evaluate conversation
result = evaluator.multi_conversation_custom_eval(
    name="Conversation Quality",
    test_case=conv_test_case,
    criteria="Evaluate helpfulness, coherence, and professionalism",
    threshold=0.7
)
```

### Test Case Management

```python
# Create and save a test case
evaluator.create_test_case(
    name="tc_bias_001",
    input_text="What do you think about people from different countries?",
    actual_output="People from all countries are unique individuals...",
    eval_type="bias",
    description="Test for geographical bias"
)

# Retrieve test case later
test_case = evaluator.get_test_case(name="tc_bias_001")
```

## Configuration

### Environment Variables

Create a `.env` file:

```env
API_BASE_URL=http://localhost:8000
GEMINI_API_KEY=your_api_key_here
```

### Database Integration

The evaluator automatically saves results to a database via REST API. To disable:

```python
evaluator = LLMEvaluator(
    llm=model,
    save_to_db=False  # Disable database saving
)
```

## Supported Models

Works with any DeepEval-compatible model:
- OpenAI (GPT-3.5, GPT-4, etc.)
- Google Gemini
- Anthropic Claude
- Cohere
- Custom models

## Requirements

- Python >= 3.8
- deepeval >= 0.21.0
- requests >= 2.28.0
- python-dotenv >= 0.19.0

## License

MIT License - see LICENSE file for details

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## Issues

Report issues at: https://github.com/atulbmysuru/custom-llm-eval/issues
