Metadata-Version: 2.4
Name: databricks-dq-framework
Version: 1.0.24
Summary: Metadata-driven Data Quality Assessment Framework for Databricks / Delta Lake
Author: Nitin Mathew George
License-Expression: MIT
Keywords: databricks,delta-lake,data-quality,pyspark,dq-framework
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: OS Independent
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: python-dateutil>=2.8
Requires-Dist: pandas>=2.0
Provides-Extra: standalone
Requires-Dist: pyspark>=3.3; extra == "standalone"
Requires-Dist: delta-spark>=2.0; extra == "standalone"
Provides-Extra: dev
Requires-Dist: pytest>=7; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"

# Data Quality Assessment Framework — Databricks / Delta Lake

**Open-source Python library** reimplementing the SQL Server DQ Assessment
Framework natively on Databricks with Delta tables, PySpark DataFrames, and
Python-based validation functions.

---

## What this Library Provides

A **metadata-driven, PySpark port of a SQL Server Data Quality Assessment Framework**
for Databricks/Delta Lake. It validates data in curated Delta tables using configurable
rules — no code changes needed to add new fields or update logic.

| Feature | SQL Server Implementation | Databricks Implementation |
|---|---|---|
| Storage | SQL Server tables | Delta Lake tables (Unity Catalog) |
| Pattern evaluation | T-SQL TVFs (generated dynamically) | Python closures (generated dynamically) |
| Row-level evaluation | `CROSS APPLY TVF` | PySpark UDF + `DeltaTable.merge()` |
| Custom rules | SQL expressions in `configCustomQuery` | Plain regex patterns or named validators |
| Email validation | Complex `LIKE` / `CHARINDEX` SQL | Native Python regex (enhanced) |
| Output | `SELECT` result sets | Spark DataFrames |
| Same field configuration | Script_02 SQL `INSERT` | `ConfigManager` Python API or raw SQL |
| Same pattern library | 118 patterns | 118 patterns (identical) |
| Same precedence logic | L0A `RANK()` CTE | Python ranking equivalent |
| Same validation levels | L01→L02→L03→L04→L99 | L01→L02→L03→L04→L99 (identical) |
| Same DQ columns | `DQEligible BIT`, `DQViolations`, `DQFields` | `DQEligible BOOLEAN`, `DQViolations STRING`, `DQFields STRING` |
| Same dual logging | `auditDQChecks` + `statDQChecks` | Same Delta tables |
| Same ExecutionID | `NEWID()` GUID | `uuid.uuid4()` |
| Same `1=1` scope filtering | `WHERE 1=1 AND (...)` | Python conditional filter |
| Same sticky DQ logic | `CASE WHEN DQEligible=0 THEN 0 ELSE Result END` | Spark `when()` equivalent |

---

## Architecture at a Glance

```
masterField (config) ──► configFieldAllowedPattern ──► mapDQChecks
        │                          │
  ConfigManager API      resolve_pattern_rules.py (L0A precedence)
        │                          │
        └──────────────────────────┘
                        │
          generate_rule_functions.py
                        │
              FunctionRegistry (UDFs)
                        │
          data_assessment_rules.py (DQRunner)
                        │
   Curated Table: DQEligible | DQViolations | DQFields
                        │
        ┌───────────────┴───────────────┐
   auditDQChecks               statDQChecks
        └───────────────┬───────────────┘
               Reporting Views
          (v_auditDQChecks, v_statDQChecks)
```

**Data flow:** Config tables → precedence resolution → Python closure generation →
UDF-based row assessment → DQ column updates + dual audit logging → reporting.

---

## Quick Start

```python
from dq_framework import DQFramework

# 1. Initialise
dq = DQFramework(spark, catalog="main", schema="dq")
dq.setup()

# 2. Configure a field (ConfigManager API — no SQL required)
cfg = dq.config
cfg.register_field(1, "Source.SRC_ContactPoint.EMAIL_ADDRESS", data_category_type_id=4)
cfg.set_field_values(1, "Source.SRC_ContactPoint.EMAIL_ADDRESS", min_data_length=6, max_data_length=255)
cfg.block_category(1, "Source.SRC_ContactPoint.EMAIL_ADDRESS", "DataEmptiness")
cfg.allow_pattern(2, "Source.SRC_ContactPoint.EMAIL_ADDRESS", "Has At Sign")
cfg.add_custom_query_regex(1, "Source.SRC_ContactPoint.EMAIL_ADDRESS",
                           r"^[^@\s]+@[^@\s]+\.[^@\s]+$", must_match=True)
cfg.add_mapping(1, "Source.SRC_ContactPoint.EMAIL_ADDRESS",
                target_schema_name="Curated", target_table_name="Entity_Email_Denorm",
                target_field_name="EMAIL_ADDRESS", target_catalog_name="main")

# 3. Validate config FK integrity before generating functions
cfg.verify_config()

# 4. Generate field-level checker functions from config
dq.generate_rule_functions()

# 4b. Optional: validate SQL-type custom queries before assessment
dq.validate_custom_queries_sql()

# 5. Run the DQ assessment
exec_id = dq.run_assessment(schema_name="Curated")

# 6. Query results
dq.violations(exec_id).display()
dq.quality_scores(exec_id).display()
dq.fields_below_threshold(threshold=80).display()
dq.summary_by_table().display()
```

See [sample_usage/](sample_usage/) for hands-on demo notebooks, sample CSV data, and the Excel config template.

---

## Built-in Usage Guide

```python
dq.guide()          # Full workflow, FullFieldName patterns, all ConfigManager methods
dq.sample_usage(spark)  # Print path to demo notebooks, sample data, and Excel template
```

---

## Prerequisite: DQ Columns on Curated Tables

Each curated Delta table being assessed must have three extra columns.
Use the built-in helper — it reads `mapDQChecks` and adds them to every mapped table at once:

```python
dq.prepare_curated_tables()                  # all tables referenced in mapDQChecks
dq.prepare_curated_table("Schema", "Table")  # or a single table
```

Both methods are **idempotent** (safe to re-run) and require **ALTER privilege** on each target table.
The assessment runner will raise a clear error if columns are missing, pointing to these methods.

The three columns added are:

```sql
DQEligible   BOOLEAN  -- 1=all checks passed, 0=at least one failed, NULL=not assessed
DQViolations STRING   -- [field: ViolationType], ... (accumulated violations)
DQFields     STRING   -- [field1], [field2] (all fields assessed on this row)
```

| Column | Value | Meaning |
|---|---|---|
| `DQEligible` | `None` / `NULL` | Row not yet assessed |
| `DQEligible` | `True` | All configured checks passed |
| `DQEligible` | `False` | At least one check failed |
| `DQViolations` | e.g. `[email_address: Data Type]` | Violated rules (sticky — accumulates across fields) |
| `DQFields` | e.g. `[email_address], [phone]` | All fields assessed on this row |

---

## Package Structure

```
dq_framework/
├── __init__.py                        # Exports: DQFramework, ConfigManager
├── framework.py                       # DQFramework facade — top-level orchestration
├── ddl_framework_tables.py            # ≡ Script_00_DDL_Framework_Tables — Delta table DDL (9 tables)
├── seed_master_data.py                # ≡ Script_01_Master_Reference_Data — 27 categories + 118 patterns
├── config.py                          # ConfigManager — Python API for all 5 config tables
├── engine/
│   ├── pattern_checks.py              # All 118 pattern check implementations
│   ├── resolve_pattern_rules.py       # Pattern precedence (L0A CTE equivalent)
│   ├── generate_rule_functions.py     # ≡ p_DQ_GenerateRuleFunctions — FunctionRegistry + closures
│   └── data_assessment_rules.py       # ≡ p_DQ_DataAssessmentRules — DQRunner
├── reporting/
│   └── views.py                       # v_auditDQChecks + v_statDQChecks equivalents
└── agents/
    ├── __init__.py                    # Exports SYSTEM_PROMPT, DQ_TOOLS, suggest_config
    ├── context.py                     # LLM system prompt + short context
    ├── tools.py                       # OpenAI-compatible tool definitions + execute_tool_call()
    └── suggest.py                     # Schema-based config auto-suggester (no LLM required)

dq_framework/sample_usage/             # Bundled inside the PyPI package — extracted to /Workspace/Users/{you}/...
├── 00-infrastructure.py               # One-time: create catalog/schema/volume
├── 01-install.py                      # Per-session: %pip install + dq.setup() + copy CSVs to volume + load as Delta
├── 02-config-pyspark.py               # Seed all config tables via Python ConfigManager methods
├── 02-config-sql.py                   # Same config, seeded via SQL MERGE statements
├── 03-run.py                          # generate_rule_functions → run_assessment → reporting
├── mock_curated_contacts.csv          # Sample data — contacts
├── mock_curated_locations.csv         # Sample data — locations
├── mock_curated_vendors.csv           # Sample data — vendors
└── DBX DQ Framework - Data Model v2.xlsx  # Excel config authoring template (see Excel Template section)
                                       # Run dq.sample_usage(spark) to extract all files to your user workspace

notebooks/                             # Developer / contributor notebooks
├── 00_quickstart.py                   # Step-by-step walkthrough: configure → assess → results
├── 01_agent_assistant.py              # AI agent integration: auto-suggest, tool-calling, system prompt
├── 02_integration_test.py             # Integration test suite (run on cluster before publishing)
└── 03_seed_config.py                  # Paste Excel-generated INSERT SQL here

tests/
├── conftest.py                        # Path setup + test split explanation
└── unit/
    ├── test_pattern_checks.py         # All 118 pattern functions (no cluster needed)
    ├── test_resolve_pattern_rules.py  # Pattern precedence logic
    ├── test_generate_rule_functions.py # Closure building + FunctionRegistry
    ├── test_seed_master_data.py       # Seed data integrity (counts, IDs, FK)
    └── test_suggest.py                # Schema-based config suggester
```

> **Getting sample files:** Run `dq.sample_usage(spark)` — it extracts the bundled sample notebooks,
> CSV data, and Excel template to `/Workspace/Users/{you}/databricks-dq-framework/sample_usage/`
> automatically. No Repos clone needed. Works on serverless and all compute types.

---

## 9 Framework Tables

All tables live in the `dq` schema (configurable). Three categories:

| Category | Tables |
|---|---|
| **Framework-Seeded** (auto-populated by `setup()`) | `masterDataCategory`, `masterPattern` |
| **User-Managed** (populated via `ConfigManager` or SQL) | `masterField`, `configFieldValues`, `configFieldAllowedPattern`, `configCustomQuery`, `mapDQChecks` |
| **Results** (auto-populated by `run_assessment()`) | `auditDQChecks`, `statDQChecks` |

| Table | Purpose |
|---|---|
| `masterDataCategory` | 27 field type classifications |
| `masterPattern` | 118 built-in validation patterns. Custom patterns use `_ID >= 1000` (IDs 1–999 are reserved for built-ins) |
| `masterField` | Source fields registered for assessment |
| `configFieldValues` | Data length (L01) and value range (L04) boundaries |
| `configFieldAllowedPattern` | Pattern allow/block rules per field |
| `configCustomQuery` | Custom validation expressions — plain regex, Spark SQL, or named validators (L02) |
| `mapDQChecks` | Maps source field → curated table column |
| `auditDQChecks` | Row-level violation audit log (failures only, append-only) |
| `statDQChecks` | Aggregated pass/fail statistics per execution (append-only) |

> **`generate_rule_functions()` output**: Prints a `•` line for each registered checker with its SQL expression count. Any DQ function referenced in `mapDQChecks` that has no rules configured (in `configFieldValues`, `configFieldAllowedPattern`, or `configCustomQuery`) is flagged with ⚠ and skipped by the assessment runner — avoiding misleading 100% quality scores for unchecked fields.

---

## Validation Levels (in order, early-exit on first failure)

Each generated field checker implements the same 5-level hierarchy as the SQL TVF:

| Level | Check Type | Priority |
|---|---|---|
| **L01** | Data Length range (min/max character count) | Checked first |
| **L02** | Custom expression rules (regex, Spark SQL, or named validator) | Checked second |
| **L03** | 118-pattern checks (in PatternPriority order) | Checked third |
| **L04** | Data Value range (min/max value comparison) | Checked fourth |
| **L99** | Default PASS (if all prior checks pass) | Final |

**Early exit**: The checker returns immediately on first failure.
Re-run after fixing the first violation to surface the next.

---

## 118 Patterns across 10 Categories

| Category | Count | Examples |
|---|---|---|
| `DataType1` | 3 | Is Fully Numeric, Is Fully Decimal, Is Fully Text |
| `DataType2` | 1 | Is AlphaNumeric |
| `DataType3` | 4 | Is Date, Is Time, Is Timestamp, Is Boolean |
| `SpecialCharacter` | 32 | Has Comma, Has At Sign, Has Hyphen, Has Pipe Symbol, … |
| `SpaceFound` | 1 | Has Space |
| `DataEmptiness` | 34 | Is Empty or NULL, Is Virtually Empty with Spaces, … |
| `InvalidKeyword` | 40 | Has Keyword-n/a, Has Keyword-null, Has Keyword-missing, … |
| `FullyDuplicatedCharacter` | 1 | Has Fully Duplicated Character (e.g. "aaaa") |
| `UnicodeCharacters` | 1 | Has Unicode Characters (outside printable ASCII 0x20–0x7E) |
| `CasingCheck` | 2 | Has Lowercase Character, Has Uppercase Character |

**Total: 118 built-in patterns.** Pattern IDs 1–999 are reserved for framework use.
To add custom patterns (e.g. domain-specific invalid keywords), use IDs `>= 1000`.

---

## Configuring Pattern Rules (`configFieldAllowedPattern`)

This table is the heart of the framework — it controls which of the 118 built-in patterns
apply to each field, and whether each pattern is required, blocked, or suppressed.

### The three selector columns

Each row in `configFieldAllowedPattern` points into `masterPattern` via one of three columns.
Populate **one column per row** and leave the other two `NULL`.

| Column | Selects | Patterns activated |
|---|---|---|
| `PatternCategory` | An entire category family | Many — e.g. all 32 `SpecialCharacter` patterns at once |
| `PatternSubCategory` | A named group within a category | Several — e.g. all 8 `Separator` chars: `. , - _ / \ ; :` |
| `PatternName` | One exact pattern by name | Exactly 1 — e.g. only `Has At Sign` |

**Precedence when two rules conflict on the same pattern:** `PatternName` > `PatternSubCategory` > `PatternCategory`.
The most specific rule always wins. Within the same specificity level, `Allowed` wins over `Not Allowed`.

```python
cfg.block_category(1, "Source.T.FIELD", "SpecialCharacter")   # block all 32 special chars
cfg.allow_pattern(2, "Source.T.FIELD", "Has Hyphen")           # except hyphen  (PatternName wins)
cfg.allow_pattern(3, "Source.T.FIELD", "Has Full Stop")        # except .
cfg.allow_pattern(4, "Source.T.FIELD", "Has At Sign")          # except @
```

### When to use each level

**`PatternCategory`** — use when you want a blanket policy for a whole family:
```
SpecialCharacter   Not Allowed  → blocks all 32 special chars in one row
InvalidKeyword     Not Allowed  → blocks all 40 junk keywords (null, n/a, test, missing …)
DataType3          Not Allowed  → blocks values that look like dates, times, or booleans
UnicodeCharacters  Not Allowed  → blocks any character outside printable ASCII (0x20–0x7E)
FullyDuplicatedChar Not Allowed → blocks "aaaaaaa", "--------" etc.
```

**`PatternSubCategory`** — use when you want a subset without naming every pattern:
```
Separator    Allowed      → allows . , - _ / \ ; : all at once (instead of 8 separate rows)
Bracket      Not Allowed  → blocks ( ) [ ] { }
Emptiness    Not Allowed  → blocks NULL, all-spaces, and all 32 "virtually empty with X" patterns
Symbol       Not Allowed  → blocks ! @ # $ % ^ & * ~ + < > ? =
```

Available SubCategories in `masterPattern`:

| Category | SubCategory | Patterns included |
|---|---|---|
| `SpecialCharacter` | `Symbol` | `! @ # $ % ^ & * ~ + < > ? =` (15) |
| `SpecialCharacter` | `Bracket` | `( ) [ ] { }` (6) |
| `SpecialCharacter` | `Separator` | `. , - _ / \ ; : \|` (9) |
| `SpecialCharacter` | `Quote` | `' " \`` (3) |
| `DataEmptiness` | `Null` | `Is Empty or NULL` only |
| `DataEmptiness` | `Spaces` | `Is Virtually Empty with Spaces` only |
| `DataEmptiness` | `Character` | All 32 "Is Virtually Empty with X" patterns |
| `DataEmptiness` | `Emptiness` *(config alias)* | All three Null + Spaces + Character groups |
| `InvalidKeyword` | `Generic` | All 40 junk keywords |

**`PatternName`** — use for precision, one specific pattern:
```
Has At Sign       Allowed      → @ is OK (email local-part)
Has Hyphen        Allowed      → - is OK (compound names, postal codes)
Has Underscore    Allowed      → _ is OK (email, login names)
Has Full Stop     Allowed      → . is OK (email domain separator)
Has Space         Allowed      → space is OK (city, vendor name, address)
Has Space         Not Allowed  → no spaces (email, country code, postal code)
Is Empty or NULL  Allowed      → field is optional — NULL is accepted
Is Empty or NULL  Not Allowed  → field is mandatory — NULL fails
Has Lowercase Character Not Allowed → enforces ALL-CAPS (country code, vendor code)
Is Fully Numeric  Not Allowed  → value must not be a pure number (name fields)
Is Fully Text     Allowed (*)  → value MUST contain no digits (see DataType rules below)
```

### Allowed vs Not Allowed — two different behaviours

The `IsPatternAllowed` flag works differently depending on the pattern category:

#### One-directional patterns (most categories)

These patterns only ask "is the unwanted thing present?".

| `IsPatternAllowed` | Behaviour |
|---|---|
| `0` / Not Allowed | Check fires → **fail** if the pattern is detected in the value |
| `1` / Allowed | Check is **skipped** — the pattern is acknowledged as acceptable, not asserted |

Applies to: `SpecialCharacter`, `SpaceFound`, `CasingCheck`, `DataEmptiness`, `InvalidKeyword`,
`FullyDuplicatedCharacter`, `UnicodeCharacters`.

> `Allowed` here is a **suppressor**, not a validator. It tells the framework
> "this pattern will appear in this field — don't flag it."

#### Bidirectional patterns (`DataType1`, `DataType2`, `DataType3`)

These check type conformance in both directions:

| `IsPatternAllowed` | Behaviour |
|---|---|
| `0` / Not Allowed | **Fail if value IS this type** — e.g. no pure numbers in a name field |
| `1` / Allowed | **Fail if value IS NOT this type** — enforces the type as mandatory |

Examples:

```
Is Fully Numeric  Not Allowed → vendor_name must not be a pure number → flags "99999", "123"
Is Fully Numeric  Allowed     → raw_phone_number must ONLY be digits → flags "+91 9698..." (has +/space)
Is Fully Text     Allowed     → state/city must contain no digits → flags "3H", "1st Avenue"
Is Date           Not Allowed → email_address cannot look like a date → flags "2024-01-15"
Is AlphaNumeric   Not Allowed → state field rejects codes that contain both letters AND digits
```

#### CasingCheck — one-directional only

`Allowed` has no effect on casing checks. Only `Not Allowed` enforces anything.

| Rule | Effect |
|---|---|
| `Has Lowercase Character` Not Allowed | Enforces ALL-CAPS — any `a-z` found = violation |
| `Has Uppercase Character` Not Allowed | Enforces all-lowercase — any `A-Z` found = violation |
| `Has Lowercase Character` Allowed | No effect (check silently suppressed) |

The check uses binary/case-sensitive matching (`COLLATE Latin1_General_BIN` equivalent).
Accented characters (`é`, `ü`, `ñ`) only trigger the Unicode check, not the casing check —
they are outside the `a-z` / `A-Z` ASCII ranges.

### The override rule in practice

The design pattern across every field: **block broadly at category level, then punch
exceptions at pattern level.**

```
email_address — 14 rules:

Category   SpecialCharacter   Not Allowed  → blocks all 32 special chars
Category   DataType3          Not Allowed  → blocks date/time/timestamp/boolean values
Category   InvalidKeyword     Not Allowed  → blocks "null", "n/a", "test" etc.
Category   FullyDupChar       Not Allowed  → blocks "aaaaaaa"
Category   UnicodeCharacters  Not Allowed  → blocks non-ASCII (é, ü, 田中, шрифт)
SubCat     Emptiness          Not Allowed  → blocks NULL, spaces, all virtually-empty patterns
PatternName Has At Sign        Allowed     → overrides SpecialCharacter block → @ is OK  ✓
PatternName Has Hyphen         Allowed     → overrides → - is OK  ✓
PatternName Has Underscore     Allowed     → overrides → _ is OK  ✓
PatternName Has Full Stop      Allowed     → overrides → . is OK  ✓
PatternName Has Space          Not Allowed → spaces are still blocked (no override)  ✓
PatternName Is Empty or NULL   Allowed     → email is optional — NULL accepted  ✓
```

### Typical starting configs by field type

| Field type | Category blocks | Pattern overrides (Allowed) |
|---|---|---|
| Email address | `SpecialCharacter`, `DataType3`, `InvalidKeyword`, `FullyDupChar`, `UnicodeCharacters`, `Emptiness` | `Has At Sign`, `Has Full Stop`, `Has Hyphen`, `Has Underscore` |
| Phone (formatted) | `DataType1`, `DataType2`, `DataType3`, `InvalidKeyword`, `FullyDupChar`, `UnicodeCharacters` | `Has Hyphen`, `Has Plus Sign`, `Has Space`, `Has Open Parenthesis`, `Has Close Parenthesis` |
| Phone (digits-only) | `SpecialCharacter`, `DataType3`, `InvalidKeyword`, `FullyDupChar`, `UnicodeCharacters`, `Emptiness` | `Is Fully Numeric` (Allowed = digits only, mandatory) |
| Postal / ZIP code | `SpecialCharacter`, `DataType3`, `InvalidKeyword`, `FullyDupChar`, `UnicodeCharacters`, `Emptiness` | `Has Hyphen`, `Has Space` |
| Person name | `DataType1`, `DataType2`, `DataType3`, `SpecialCharacter`, `InvalidKeyword`, `FullyDupChar`, `Emptiness` | `Has Hyphen`, `Has Full Stop`, `Has Single Quote`, `Has Space` |
| Vendor / company name | `DataType3`, `SpecialCharacter`, `InvalidKeyword`, `FullyDupChar`, `UnicodeCharacters`, `Emptiness` | `Has Space`, `Has Hyphen`, `Has Full Stop` |
| Country code (2-char) | `DataType1`, `DataType2`, `DataType3`, `SpecialCharacter`, `InvalidKeyword`, `FullyDupChar`, `UnicodeCharacters`, `Emptiness` | `Is Fully Text` (Allowed), add `Has Lowercase Character` Not Allowed for ALL-CAPS |
| Street address | `DataType3`, `InvalidKeyword`, `FullyDupChar`, `UnicodeCharacters`, `Emptiness` | `SpecialChar Separator` (Allowed), `Has Space`, block `Has Exclamation Mark`, `Has Question Mark` |
| Trade / lookup code | `InvalidKeyword`, `FullyDupChar`, `UnicodeCharacters`, `Emptiness`, `100% Numeric` | `SpecialChar-L1`, `SpecialChar-L2`, `Has Semicolon`, `Has Space` |

---

## Custom Validation Rules (L02)

The `configCustomQuery` table stores a `CustomQueryExpression` for each field.
Three options are available.

### Option 1: Plain regex (recommended for non-technical users)

```python
# Must match: value must satisfy the regex to PASS
cfg.add_custom_query_regex(1, "Source.T.EMAIL_ADDRESS",
                           r"^[^@\s]+@[^@\s]+\.[^@\s]+$", must_match=True,
                           description="Basic email format")

# Must NOT match: if the regex matches, the row FAILS
cfg.add_custom_query_regex(2, "Source.T.EMAIL_ADDRESS",
                           r"noemaildress", must_match=False,
                           description="Reject placeholder values")
```

The engine auto-detects regex patterns by the presence of metacharacters
(`^$+*?[({|\`) and applies `re.search()` automatically.

### Option 2: Spark SQL expression

```python
# @InputValue is replaced with the actual column reference at assessment time
cfg.add_custom_query_sql(1, "Source.T.AMOUNT",
                         "CAST(@InputValue AS DOUBLE) > 0",
                         is_condition_allowed=True,
                         description="Amount must be positive")
```

SQL expressions are applied at the DataFrame level via `F.expr()` — not inside a
Python UDF — so they support any Spark SQL built-in function.

### Option 3: Named validator (for complex logic)

Register a Python function once, then reference it by name in any field config:

```python
# Register (once, at cluster startup or in a setup cell)
dq.register_validator("au_mobile", lambda v: bool(__import__('re').match(r'^04[0-9]{8}$', v or '')))

# Reference by name in config — query_id is optional, auto-assigned if omitted
cfg.add_custom_query("Source.T.MOBILE_NUMBER", "au_mobile",
                     is_condition_allowed=True, custom_query_type="PYTHON")
```

Built-in email validators are pre-registered:

| Validator Name | Description |
|---|---|
| `email_basic_format` | Regex: basic `x@y.z` format |
| `email_at_validation` | @ position, count, dot after @ |
| `email_domain_format` | 1–3 dots, no trailing dot, 2+ char labels |
| `email_local_part_dots` | Max 2 dots in local part |
| `email_no_placeholder` | Must not contain 'noemaildress' |
| `email_no_trailing_special` | Must not end with `.`, `-`, or `_` |

---

## Scope Filtering (`1=1` Pattern Preserved)

```python
# All schemas, all tables, all fields
dq.run_assessment()

# Specific schema
dq.run_assessment(schema_name="Curated")

# Specific table
dq.run_assessment(schema_name="Curated", table_name="Individual_Denorm")

# Single field
dq.run_assessment(schema_name="Curated",
                  table_name="Individual_Denorm",
                  field_name="FIRST_NAME")

# Reset previously-failed rows before re-assessment
dq.run_assessment(schema_name="Curated", reset_eligible_flag=True)
```

---

## Reporting

Four helper methods return Spark DataFrames ready for `.display()` or further transformation:

```python
exec_id = dq.run_assessment(schema_name="Curated")

dq.violations(exec_id).display()              # Row-level violations (field, violation type, value)
dq.quality_scores(exec_id).display()          # Pass/fail counts + % per field
dq.fields_below_threshold(threshold=80).display()  # Fields with quality < 80%
dq.summary_by_table(exec_id).display()        # Aggregated quality score per table
```

Two persistent reporting views are also created in the `dq` schema by `setup()`:

| View | Purpose |
|---|---|
| `v_auditDQChecks` | Row-level violations joined with field metadata |
| `v_statDQChecks` | Aggregated pass/fail statistics with `PercentageQualified` |

---

## AI Agent Support

The `dq_framework.agents` module provides three ways to use AI to configure the framework.
**No agent framework (Semantic Kernel, LangChain, AutoGen) is required.**

### Approach A — Auto-suggest from schema (no LLM needed)

```python
from dq_framework.agents import suggest_config

schema = [
    ("EMAIL_ADDRESS", "string"),
    ("FIRST_NAME",    "string"),
    ("POSTAL_CODE",   "string"),
    ("CUSTOMER_ID",   "int"),
]

# Returns ready-to-run ConfigManager code
code = suggest_config(schema, source_schema="Source", source_table="SRC_Customer",
                      curated_schema="Curated", curated_table="Customer_Denorm",
                      catalog="main")
print(code)
```

Or read schema directly from a live Delta table:

```python
schema = spark.table("main.Source.SRC_Customer").dtypes
code = suggest_config(schema, source_schema="Source", source_table="SRC_Customer", ...)
```

### Approach B — Chat assistant with tool calling

The 14 framework operations are exposed as OpenAI-compatible JSON Schema tool definitions.
They work unchanged with OpenAI, Azure OpenAI, Anthropic Claude, Semantic Kernel, AutoGen,
and LangChain.

```python
from dq_framework.agents import SYSTEM_PROMPT, DQ_TOOLS
from dq_framework.agents.tools import execute_tool_call
from openai import AzureOpenAI

client = AzureOpenAI(azure_endpoint="...", api_key="...", api_version="2024-05-01-preview")

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user",   "content": "Configure DQ rules for Source.SRC_Party.EMAIL_ADDRESS"},
    ],
    tools=DQ_TOOLS,
)

for call in response.choices[0].message.tool_calls or []:
    result = execute_tool_call(call.function.name, call.function.arguments, dq, cfg)
    print(f"[Tool] {call.function.name} → {result}")
```

See [notebooks/01_agent_assistant.py](notebooks/01_agent_assistant.py) for Azure OpenAI,
Anthropic Claude, and Semantic Kernel examples.

> **Note:** `notebooks/` contains developer/contributor notebooks. The consumer-facing
> demo is in [sample_usage/](sample_usage/) — run `dq.sample_usage(spark)` for the path.

### Approach C — Paste system prompt into ChatGPT / Copilot

```python
from dq_framework.agents import SYSTEM_PROMPT
print(SYSTEM_PROMPT)  # Copy and paste into any chat interface
```

---

## Testing

### Unit tests — no cluster needed

Run locally with Python and `pytest`. No Databricks, no Spark, no Delta Lake required.

```bash
pip install dq-framework pytest
pytest tests/unit/ -v
```

| Test file | What it covers |
|---|---|
| `test_pattern_checks.py` | All major pattern check functions (DataEmptiness, SpecialChar, InvalidKeyword, …) |
| `test_resolve_pattern_rules.py` | Pattern precedence — specific overrides broad, inactive excluded |
| `test_generate_rule_functions.py` | L01/L02/L04 closure building, `FunctionRegistry`, stale function removal |
| `test_seed_master_data.py` | Seed data integrity: 27 categories, 118 patterns, no duplicate IDs |
| `test_suggest.py` | Schema-based config suggester: column classification, code generation |

### Integration tests — requires a Databricks cluster

Open [notebooks/02_integration_test.py](notebooks/02_integration_test.py) on a Databricks cluster.
It creates throwaway test schemas, runs the full pipeline end-to-end, and drops everything on completion.
For a hands-on consumer demo instead, use the notebooks in [sample_usage/](sample_usage/).

| Test | What it verifies |
|---|---|
| Test 1 | Framework setup — 9 tables created, 27 categories, 118 patterns seeded |
| Test 2 | `setup()` is idempotent — no duplicate data on second call |
| Test 3 | Synthetic curated table created with known good/bad email values |
| Test 4 | `ConfigManager` — all 5 user-managed tables populated |
| Test 5 | `generate_rule_functions()` — checker built and directly invocable |
| Test 6 | `run_assessment()` — DQ columns written correctly, specific rows verified |
| Test 7 | Reporting views — violations count, quality scores, threshold filter, summary |
| Test 8 | `reset_eligible_flag` — failed rows cleared, re-assessment gives same result |
| Test 9 | Custom keyword extensibility via `add_invalid_keyword()` |
| Test 10 | `cfg.verify_config()` passes with no FK issues |

---

## Platform Requirements

| Requirement | Detail |
|---|---|
| Databricks Runtime | 12.0+ (includes PySpark 3.3+) |
| Delta Lake | 2.0+ |
| Python | 3.9+ |
| Unity Catalog | Recommended (pass `catalog=""` for legacy Hive metastore) |
| python-dateutil | Included in Databricks Runtime; required for full date/time validation |

---

## Adding a New Source Field

1. Register the field: `cfg.register_field(id, "Schema.Table.Column", data_category_type_id=N)`
2. Set length bounds: `cfg.set_field_values(id, ffn, min_data_length=M, max_data_length=N)`
3. Add pattern rules: `cfg.block_category(...)`, `cfg.allow_pattern(...)`, `cfg.block_pattern(...)`
4. Add custom expressions: `cfg.add_custom_query_regex(id, ffn, r"your_regex", must_match=True)`
5. (Optional) Add custom invalid keywords: `dq.add_invalid_keyword(id, "your_keyword")`
6. Map to curated column: `cfg.add_mapping(id, ffn, target_schema_name=..., target_table_name=..., ...)`
7. Validate FK integrity: `cfg.verify_config()`
8. Re-run: `dq.generate_rule_functions()`
9. (Optional) Pre-flight SQL validation: `dq.validate_custom_queries_sql()`

---

## Excel Template — Config Authoring Workbook

The workbook (`DBX DQ Framework - Data Model v2.xlsx`, located in `sample_usage/`) is the
recommended way to author and maintain DQ configurations for any non-trivial field set.

### Why use the template?

Configuring DQ rules manually in Python or SQL for 10+ fields — each with dozens of pattern
allow/block rules, length bounds, and custom expressions — is repetitive, error-prone, and
hard to audit. The Excel template solves this by acting as both a **configuration record**
and a **code generator**:

- One row per rule; columns map directly to framework concepts (field name, pattern category,
  allow/block, expression, description, active flag …)
- Built-in formulas generate ready-to-run Python method calls **and** SQL MERGE rows from the
  same data — no manual escaping or formatting required
- The filled-in workbook becomes a human-readable audit trail of what rules are configured,
  why, and when — something that is hard to get from raw code alone
- Non-developers (data stewards, business analysts) can review, edit, and sign off on rules in
  a familiar spreadsheet before anyone runs a single notebook cell

### Sheets and what to fill in

| Sheet | Maps to table | What to fill in |
|---|---|---|
| `masterField` | `masterField` | Field ID, `FullFieldName` (pattern A or B), `DataCategoryTypeID` |
| `configFieldValues` | `configFieldValues` | Per-field min/max data length (L01) and min/max value range (L04) |
| `configFieldAllowedPattern` | `configFieldAllowedPattern` | Pattern allow/block rules — one row per rule (L03) |
| `configCustomQuery` | `configCustomQuery` | Custom regex, Spark SQL, or named validator expressions (L02) |
| `mapDQChecks` | `mapDQChecks` | Field → curated table column mappings |

### Deciding pattern rules for a field (`configFieldAllowedPattern`)

This is the most complex sheet to fill in and the one where the template adds the most value.
See [Configuring Pattern Rules](#configuring-pattern-rules-configfieldallowedpattern) above for
the full guide — three-column selector rules, Allowed vs Not Allowed behaviour for each
category, CasingCheck specifics, the override mechanism, and ready-made starting configs
by field type.

Quick summary for the Excel sheet:
- One row per rule, one of `PatternCategory` / `PatternSubCategory` / `PatternName` populated per row
- Start with broad category blocks, then add PatternName exceptions (Allowed) to punch holes
- `Allowed` on DataType patterns is a **mandatory type assertion** (value MUST be that type)
- `Allowed` on all other categories is a **suppressor** (skip the check for this pattern)
- A typical email field needs ~10–14 rows; a postal code ~6–8; a name field ~8–12

A typical email field ends up with ~10–14 rows in this sheet. A phone number or postal code
is similar. The template makes it easy to copy a field's block as a starting point for a
similar field and adjust.

### Two output columns — M (PySpark) and N (SQL)

Each data sheet has two formula output columns that generate ready-to-paste code:

| Column | Generates | Paste destination |
|---|---|---|
| **M** | `dq.config.<method>(...)` Python call | Directly into a Python notebook cell |
| **N** | `SELECT ... UNION ALL` row | Inside the body of a SQL MERGE template (see below) |

**Column M is the recommended path.** Copy the M cells for your rows, paste into
`02-config-pyspark.py` (or any Python cell), and run. The formula handles all
Databricks-compatible escaping internally — backslash doubling and single-quote escaping
are applied in the correct order so `add_custom_query()` stores the expression correctly.

**Column N (SQL MERGE path)** produces rows for a bulk MERGE statement. The generated rows
must be embedded inside a `r"""..."""` Python raw string — **never** `f"""..."""`:

```python
# Correct: raw string for the SQL body, f-string only for catalog/schema names
tgt   = f"`{MY_CATALOG}`.`{DQ_SCHEMA}`.`configCustomQuery`"
query = (
    r"MERGE INTO " + tgt + r" AS t "
    r"USING (SELECT * FROM (VALUES "
    r"  (1, 'Schema.Table.Field', 'REGEX', '\\d+', ...) "   # ← paste N-column rows here
    r") AS s(...) WHERE _ID IS NOT NULL) ON t._ID = s._ID "
    r"WHEN MATCHED AND (...) THEN UPDATE SET ... "
    r"WHEN NOT MATCHED THEN INSERT ..."
)
spark.sql(query)
```

Why `r"""..."""`? Python f-strings process backslash sequences before Spark SQL ever
sees the string — `\'` becomes `'` silently, breaking `@InputValue` expressions with
`PARSE_SYNTAX_ERROR: Syntax error at or near '@'`. A raw string passes the SQL verbatim.

> If you paste N-column rows **directly into a SQL notebook cell** (not a Python string),
> replace every `''` inside expression values with `\'` — Databricks SQL legacy mode uses
> backslash escaping, not ANSI quote-doubling.

### Escaping rules (handled automatically by the formula)

| In your expression | Formula produces | Stored in table | Spark SQL sees |
|---|---|---|---|
| `\d+` (regex backslash) | `\\d+` | `\\d+` | `\d+` ✓ |
| `O'Brien` (quote) | `O\'Brien` (M) / `O''Brien` (N) | `O'Brien` | `O'Brien` ✓ |

You do **not** need to pre-escape anything before typing into the Excel cells. Fill in the
raw expression (e.g. `^\d{4}$`); the formula handles the rest.

---

## Known Limitations

| # | Area | Description |
|---|---|---|
| 1 | **Bug — `add_custom_query()`** | `CustomQueryDescription` column is `NOT NULL` in the DDL but the method passes `null` when no `description` argument is provided. Always supply a `description` when calling `add_custom_query()`, `add_custom_query_regex()`, or `add_custom_query_sql()` to avoid an INSERT constraint violation. |
| 2 | **Silent pattern skip** | If a `PatternName` in `configFieldAllowedPattern` does not match any row in `masterPattern`, the rule is silently ignored (equivalent to SQL's `ELSE NULL`). A typo in a pattern name will go undetected unless `verify_config()` is run. |
| 3 | **No transaction guarantees** | The assessment writes DQ columns, audit rows, and stat rows in three separate Spark actions. A cluster failure between steps can leave partial results. Re-run the assessment with `reset_eligible_flag=True` to recover. |
| 4 | **Date validation outside Databricks** | If `python-dateutil` is not installed, date/time pattern checks fall back to three hard-coded regex formats. This is rarely an issue on Databricks Runtime but may affect local unit test runs that exercise the date validators. |
| 5 | **No parameterised SQL** | Config is written via f-string SQL. Backslash and single-quote escaping uses Databricks legacy Spark SQL mode (`\` as escape character). Field names and expressions provided by trusted developers are safe; do not pass untrusted user input directly into ConfigManager methods. |

---

*Data Quality Non-Functional Assessment Framework — Databricks Edition*
*Generalised for open-source publication — no client-specific data included*
