Metadata-Version: 2.4
Name: traceurs-v1
Version: 1.0.0
Summary: Traceurs v1 Incident OS - Production incident replay and root cause analysis for AI agents
Author: Botmartz IT Solutions
License-Expression: MIT
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: fastapi>=0.100.0
Requires-Dist: uvicorn>=0.24.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: python-dotenv>=1.0.0
Requires-Dist: anthropic>=0.7.0
Provides-Extra: dev
Requires-Dist: pytest>=8.0.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.21.0; extra == "dev"
Requires-Dist: pytest-cov>=4.1.0; extra == "dev"
Requires-Dist: httpx>=0.25.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Requires-Dist: mypy>=1.0.0; extra == "dev"
Provides-Extra: export
Requires-Dist: reportlab>=4.0.0; extra == "export"
Requires-Dist: weasyprint>=59.0; extra == "export"
Dynamic: license-file

# Traceurs v1 Incident OS

Production incident replay and root cause analysis for AI agents.

When an AI agent fails in production, Traceurs reconstructs the full run across prompts, tool calls, policy decisions, human overrides, outputs, and costs so your team can replay the incident, find root cause fast, and ship the fix with evidence.

Built and maintained by Botmartz IT Solutions.

## Why Traceurs

Production AI incidents are difficult to reconstruct because the failure path is usually spread across model calls, tool execution, policy checks, human overrides, and downstream side effects.

Traceurs gives you one focused workflow for that moment:

- ingest the run
- replay the exact sequence
- identify the failure point
- export evidence for debugging and postmortems

This v1 release is intentionally narrow. It is designed to own the incident moment first, then expand carefully.

## What You Get

- Incident ingestion API for production AI runs
- Replay timeline for span-by-span inspection
- Root-cause analysis with deterministic fallback when no model key is configured
- JSON and PDF export for postmortems and evidence sharing
- Local replay UI for self-hosted evaluation and debugging
- Python packaging for installation and release to PyPI
- GitHub Actions for test, build, and release automation

## Quick Start

### Installation

```bash
# Clone and navigate
git clone <repo>
cd v1-incident-os

# Create virtual environment
python -m venv venv
source venv/bin/activate  # or `venv\Scripts\activate` on Windows

# Install dependencies
pip install -r requirements.txt

# Install as an editable package
pip install -e .

# Set up environment
cp .env.example .env
# Edit .env and add your ANTHROPIC_API_KEY
```

### Run the Server

```bash
# Development (hot reload)
make dev

# Production
make start

# Installed package command
traceurs-v1-api

# API docs
open http://localhost:8000/docs
```

### Run the Replay UI

```bash
cd frontend/replay-ui
npm install
npm run dev

# Open http://localhost:3000
```

The UI expects the backend API at `http://localhost:8000` by default. Override it with `VITE_TRACEURS_API_BASE_URL` if needed.

### Run Tests

```bash
# All tests
make test

# Watch mode
make test-watch

# With coverage
make test-coverage
```

## API Overview

### Core Endpoints

**POST /v1/runs** — Ingest agent incident
```bash
curl -X POST http://localhost:8000/v1/runs \
  -H "Content-Type: application/json" \
  -d '{
    "trace_id": "trace-001",
    "agent_id": "research-bot",
    "agent_name": "ResearchBot",
    "model": "gpt-4o",
    "status": "success",
    "run_started_at": 1715000000000,
    "run_ended_at": 1715000005000,
    "spans": [...]
  }'
```

**GET /v1/runs/{trace_id}/timeline** — Retrieve incident timeline
```bash
curl http://localhost:8000/v1/runs/trace-001/timeline
```

**GET /v1/runs/{trace_id}/export.json** — Export as JSON
```bash
curl http://localhost:8000/v1/runs/trace-001/export.json > incident.json
```

**GET /v1/runs/{trace_id}/export.pdf** — Export as PDF
```bash
curl http://localhost:8000/v1/runs/trace-001/export.pdf > incident.pdf
```

**GET /v1/runs** — List all incidents
```bash
curl http://localhost:8000/v1/runs?agent_id=research-bot&limit=50
```

**GET /health** — Health check
```bash
curl http://localhost:8000/health
```

## Span Types

Traceurs captures multiple span types to reconstruct the complete incident:

### TokenSpan
Model invocation with tokens, latency, cost.
```python
{
  "trace_id": "trace-001",
  "span_id": "span-1",
  "span_type": "token",
  "model": "gpt-4o",
  "input_tokens": 1200,
  "output_tokens": 800,
  "total_tokens": 2000,
  "ttft_ms": 450,
  "total_duration_ms": 2300,
  "finish_reason": "tool_calls",
  "cost_usd": 0.045
}
```

### ToolCallSpan
Agent tool invocation.
```python
{
  "trace_id": "trace-001",
  "span_id": "span-2",
  "parent_span_id": "span-1",
  "span_type": "tool_call",
  "tool_name": "search_web",
  "tool_input": {"query": "AI safety"},
  "tool_output": {"results": [...]},
  "duration_ms": 800
}
```

### GuardrailSpan
Policy or security check.
```python
{
  "trace_id": "trace-001",
  "span_id": "span-3",
  "span_type": "guardrail",
  "guardrail_name": "jailbreak_detector",
  "check_category": "jailbreak",
  "severity": "critical",
  "action": "block",
  "reason": "Detected potential jailbreak attempt"
}
```

### LifecycleSpan
Agent lifecycle events.
```python
{
  "trace_id": "trace-001",
  "span_id": "span-4",
  "span_type": "lifecycle",
  "event": "agent_started",
  "agent_id": "research-bot",
  "loop_iteration": 1
}
```

### OverrideSpan
Human intervention.
```python
{
  "trace_id": "trace-001",
  "span_id": "span-5",
  "span_type": "override",
  "override_type": "correction",
  "override_reason": "User corrected tool output",
  "override_by": "user@company.com",
  "changed_value": {...}
}
```

### CostSpan
Cost tracking.
```python
{
  "trace_id": "trace-001",
  "span_id": "span-6",
  "span_type": "cost",
  "cost_category": "tokens",
  "cost_usd": 0.045,
  "units": 2000,
  "budget_remaining": 9.955
}
```

## Project Structure

```
v1-incident-os/
├── backend/
│   ├── main.py              # FastAPI app
│   ├── models.py            # Pydantic schemas
│   ├── database.py          # SQLite persistence
│   ├── root_cause.py        # LLM root cause analysis
│   ├── export_handler.py    # PDF/JSON export
│   └── __init__.py
├── frontend/
│   └── replay-ui/           # Vite + React replay console
├── tests/
│   └── test_v1_launch_gate.py
├── docs/
│   ├── v1-launch-checklist.md
│   ├── design-partner-onboarding.md
│   └── api.md
├── pyproject.toml
├── requirements.txt
├── .env.example
├── Makefile
├── README.md
└── Dockerfile
```

## Configuration

Copy `.env.example` to `.env` and configure:

```bash
# Database location
TRACEURS_DB_PATH=traceurs_incidents.db

# API server
API_HOST=0.0.0.0
API_PORT=8000

# Root cause analysis (required)
ANTHROPIC_API_KEY=sk-ant-...

# Environment
ENVIRONMENT=development
LOG_LEVEL=INFO
```

## Development

### Code Style

```bash
make format   # Black formatter
make lint     # Ruff + mypy
```

### Testing

```bash
make test          # Run all tests
make test-watch    # Watch mode
make test-coverage # Coverage report
```

### Frontend Build

```bash
cd frontend/replay-ui
npm install
npm run build
```

## CI/CD and Releases

- CI validates every push and pull request by running backend tests and building the frontend.
- CD publishes the Python package to PyPI when you publish a GitHub release.
- GitHub release notes are categorized via `.github/release.yml`.

## Who This Is For

- AI platform and infrastructure teams running customer-facing agents
- Engineers who need to replay failures instead of guessing from logs
- Teams that want a focused incident workflow before adopting a broader control plane

## Positioning

Traceurs v1 is not trying to be a generic orchestration layer, a prompt manager, or a full governance suite.

Traceurs v1 is the incident operating system for production AI agents.

## Local End-to-End Flow

1. Start the backend with `make dev`.
2. Start the frontend with `npm run dev` from `frontend/replay-ui`.
3. Ingest a run using `POST /v1/runs`.
4. Open the replay UI and inspect/export the incident.

All tests are in `tests/test_v1_launch_gate.py`. Run before committing.

## Deployment

### Local Development
```bash
make dev
```

### Docker

```bash
# Build
make docker-build

# Run
make docker-run
```

### AWS Deployment (Coming Soon)
- ECS Fargate with RDS PostgreSQL
- See `deployment/aws/` for Terraform config

### Self-Hosted

Requirements: Python 3.10+, SQLite

```bash
# Install
pip install -r requirements.txt

# Run
make start

# Optional: use PostgreSQL instead of SQLite
# Update database.py to use psycopg2
```

## Roadmap

### v1.0 (Current)
- ✅ Incident ingestion API
- ✅ Timeline assembly & replay
- ✅ Root cause analysis (Claude-powered)
- ✅ JSON/PDF export
- ✅ Health check & stats
- 🔄 React replay UI (in progress)
- 🔄 Slack alerting (in progress)
- 🔄 PagerDuty integration (in progress)

### v1.1
- Multi-tenant support
- Better PDF exports (ReportLab)
- Replay playback (timeline scrubbing)
- Advanced filtering (by agent, model, status, cost)

### v2.0 (Phase 2 - Month 6+)
- Governance & compliance
- Cost optimization recommendations
- Agent performance benchmarking
- Marketplace integrations

## Support

**Documentation:** See `/docs` folder

**Issues:** GitHub Issues

**Discord:** Join community (link coming)

## License

MIT License - See LICENSE file

Copyright (c) 2026 Botmartz IT Solutions

## Contributing

We welcome contributions! See CONTRIBUTING.md

---

**Traceurs** — The incident operating system for production AI agents.

Built for platform teams running customer-facing AI agents who need to investigate failures fast.
