Metadata-Version: 2.4
Name: explainio-airflow-agent
Version: 0.2.0
Summary: Explain.io FinOps Airflow Cost Agent
Project-URL: Homepage, https://explain.io
Project-URL: Repository, https://github.com/eskarimov/Explain.io
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Framework :: Apache Airflow
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: requests

# 💸 Explain.io — Airflow Cost Agent

[![PyPI](https://img.shields.io/pypi/v/explainio-airflow-agent.svg)](https://pypi.org/project/explainio-airflow-agent/)
[![Beta](https://img.shields.io/badge/Status-Beta-blue.svg)]()
[![Python](https://img.shields.io/badge/Python-3.9+-yellow.svg)]()
[![Airflow](https://img.shields.io/badge/Airflow-2.x%20%7C%203.x-red.svg)]()

**Explain.io** is a FinOps tool for Data Engineers. This lightweight Airflow plugin automatically attributes Google Cloud BigQuery compute costs to the exact DAG and Task that triggered them.

No more guessing which pipeline caused the bill spike. No code changes required.

## ✨ Features

* **Zero Code Changes** — Uses Airflow's [Listener](https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/listeners.html) interface to automatically intercept all BigQuery task completions. You don't need to modify any DAG or add any callbacks.
* **Zero Blast Radius** — All network requests run asynchronously in a separate thread with strict timeouts. If the Explain.io API is unreachable, your DAG **will still succeed**. Exceptions are silently caught and logged.
* **Instant Dashboard** — View your pipeline costs, projected monthly spend, and heaviest DAGs at [explain.io](https://explain.io).

## 🔍 How It Works

```
┌──────────────┐     ┌─────────────────────┐     ┌──────────────────┐
│  Airflow      │     │  Explain.io Agent    │     │  Explain.io API  │
│  (Scheduler)  │────▶│  (Listener Plugin)   │────▶│  (Backend)       │
│               │     │                     │     │                  │
│  Task succeeds│     │  1. Intercepts event │     │  Fetches BQ cost │
│               │     │  2. Extracts job ID  │     │  from GCP API    │
│               │     │  3. POSTs to API     │     │  and stores it   │
└──────────────┘     └─────────────────────┘     └──────────────────┘
```

The agent registers as an Airflow **Listener plugin**. On every task success, it checks if the task was a BigQuery operator, extracts the job ID via XCom, and sends it to the Explain.io API — all in a background thread.

---

## 🚀 Installation

### Quick Start (Any Environment)

```bash
pip install explainio-airflow-agent
```

Then set two environment variables in your Airflow environment:

| Variable | Required | Description |
|----------|----------|-------------|
| `EXPLAIN_IO_API_KEY` | ✅ | Your Explain.io API key (get one from the dashboard) |
| `EXPLAIN_IO_API_URL` | ❌ | API endpoint. Defaults to `https://api.explain.io/api/v1/ingest` |

**Verify it's working:**

```bash
airflow plugins
```

You should see `cost_agent_plugin` in the output.

---

## 📦 Installation by Platform

### Self-Hosted: Docker (Recommended)

> **Don't use `_PIP_ADDITIONAL_REQUIREMENTS`** — it reinstalls packages on every container start and is only meant for quick testing.

Extend the official Airflow image with a `Dockerfile`:

```dockerfile
FROM apache/airflow:3.1.6

USER airflow
RUN pip install --no-cache-dir explainio-airflow-agent
```

Then in your `docker-compose.yaml`, replace `image:` with `build:`:

```yaml
x-airflow-common:
  &airflow-common
  build: ./airflow          # instead of: image: apache/airflow:3.1.6
  environment:
    EXPLAIN_IO_API_KEY: ${EXPLAIN_IO_API_KEY}
    EXPLAIN_IO_API_URL: ${EXPLAIN_IO_API_URL}
    # ... other env vars
```

Build and start:

```bash
docker compose build
docker compose up -d
```

> **Important:** All Airflow components (scheduler, webserver, workers, triggerer) must use the same image. The shared `x-airflow-common` anchor ensures this automatically.

---

### Self-Hosted: Bare Metal / virtualenv

Install directly into the Python environment where Airflow runs:

```bash
pip install explainio-airflow-agent
```

Set the environment variables:

```bash
export EXPLAIN_IO_API_KEY="your-api-key"
export EXPLAIN_IO_API_URL="https://api.explain.io/api/v1/ingest"  # optional
```

If you run a multi-node setup (e.g., separate scheduler and workers), the package must be installed on **every node**.

---

### AWS MWAA (Managed Workflows for Apache Airflow)

1. Create or update your `requirements.txt` with the Airflow constraints file:

   ```txt
   --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.10.1/constraints-3.11.txt"
   explainio-airflow-agent==0.1.2
   ```

   > **Adjust the constraints URL** to match your MWAA Airflow version and Python version. See [MWAA docs](https://docs.aws.amazon.com/mwaa/latest/userguide/working-dags-dependencies.html).

2. Upload `requirements.txt` to your MWAA environment's S3 bucket.

3. In the AWS Console → MWAA → your environment → **Edit** → point to the new `requirements.txt` path → **Save**.

4. Set `EXPLAIN_IO_API_KEY` and (optionally) `EXPLAIN_IO_API_URL` under **Environment variables** → **Airflow configuration options**, using the prefix `AIRFLOW__`:

   - Key: `EXPLAIN_IO_API_KEY`, Value: `your-api-key`

5. Wait for the environment update to complete (can take 20–40 minutes).

**Troubleshooting:** Check CloudWatch Logs → Log group: `airflow-{env-name}-requirements_install` for pip errors.

---

### Google Cloud Composer

#### Option A — Console UI

1. Go to **Environments** → click your environment → **PyPI Packages** tab.
2. Click **Edit** → **Add Package**.
3. Enter:
   - Package name: `explainio-airflow-agent`
   - Version: `>=0.1.2`
4. **Save**. Composer will restart workers and the scheduler.

#### Option B — gcloud CLI

```bash
gcloud composer environments update YOUR_ENV_NAME \
    --location YOUR_LOCATION \
    --update-pypi-package "explainio-airflow-agent>=0.1.2"
```

#### Environment variables

Set via the Console UI (Environment Variables tab) or CLI:

```bash
gcloud composer environments update YOUR_ENV_NAME \
    --location YOUR_LOCATION \
    --update-env-variables EXPLAIN_IO_API_KEY=your-api-key
```

---

### Astronomer (Astro CLI)

1. Add the package to `requirements.txt` in your Astro project root:

   ```txt
   explainio-airflow-agent==0.1.2
   ```

2. Set environment variables in your `.env` file:

   ```
   EXPLAIN_IO_API_KEY=your-api-key
   ```

3. Restart:

   ```bash
   astro dev restart
   ```

4. Verify:

   ```bash
   astro dev bash --scheduler "airflow plugins"
   ```

For Astro Cloud deployments, add the environment variables via the Astro UI under **Deployments → Environment Variables**.

---

## ⚙️ Configuration Reference

| Variable | Required | Default | Description |
|----------|----------|---------|-------------|
| `EXPLAIN_IO_API_KEY` | ✅ | — | Your Explain.io API key. If not set, the plugin logs an info message and becomes a no-op. |
| `EXPLAIN_IO_API_URL` | ❌ | `https://api.explain.io/api/v1/ingest` | Override the API endpoint (e.g., for self-hosted backends). |

---

## 🔧 Troubleshooting

### Plugin not showing in `airflow plugins`

- Make sure the package is installed in the **same Python environment** as Airflow.
- Run `pip show explainio-airflow-agent` to confirm it's installed.
- Check Airflow scheduler logs for import errors.

### Events not appearing in the dashboard

- Confirm `EXPLAIN_IO_API_KEY` is set. Without it, the agent silently disables itself.
- Check that `EXPLAIN_IO_API_URL` points to a reachable endpoint.
- Look for `[ExplainIO]` log lines in the Airflow task logs.

### AWS MWAA: Package fails to install

- Ensure the `--constraint` line in `requirements.txt` matches your MWAA Airflow + Python version.
- Check the `requirements_install` log stream in CloudWatch.

---

## dbt Support

Explain.io automatically captures BigQuery costs from dbt-core tasks running on Airflow.

### Basic Mode (Zero-Config)

If your Airflow DAGs run `dbt run` via `BashOperator`, `KubernetesPodOperator`, or [Cosmos](https://github.com/astronomer/astronomer-cosmos), Explain.io will automatically detect BigQuery jobs that ran during each task's execution window.

**Requirements:**
- Install the Explain.io Airflow agent (you already did this)
- Connect your GCP project in the Explain.io dashboard
- Ensure the service account has `roles/bigquery.resourceViewer` (for `bigquery.jobs.listAll`)

Costs will appear in your dashboard grouped by DAG and task.

### Enhanced Mode (Model-Level Attribution)

For per-dbt-model cost breakdowns, add job labeling to your `dbt_project.yml`:

```yaml
# dbt_project.yml
query-comment:
  comment: "{{ query_comment(node) }}"
  job-label: true
```

This tells dbt to label every BigQuery job with the model name (`node_id`). Explain.io reads these labels to show you exactly which dbt models are costing the most.

### GCP Permissions

Your service account needs the **BigQuery Resource Viewer** role (`roles/bigquery.resourceViewer`) to query `INFORMATION_SCHEMA.JOBS_BY_PROJECT`. This is read-only access to job metadata — it cannot read your table data or run queries against your datasets.

Grant it via:

```bash
gcloud projects add-iam-policy-binding YOUR_PROJECT_ID \
  --member="serviceAccount:YOUR_SA@YOUR_PROJECT.iam.gserviceaccount.com" \
  --role="roles/bigquery.resourceViewer"
```

---

## 📄 License

MIT
