Metadata-Version: 2.4
Name: crua
Version: 1.0.6
Summary: Whitelist-based crawler and browser detection via user agent strings.
Author-email: TN3W <tn3w@protonmail.com>
License-Expression: Apache-2.0
Project-URL: Homepage, https://github.com/tn3w/crua
Project-URL: Repository, https://github.com/tn3w/crua.git
Project-URL: Issues, https://github.com/tn3w/crua/issues
Keywords: user-agent,crawler,bot-detection,browser,whitelist,parser
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Internet :: WWW/HTTP
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Security
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Provides-Extra: dev
Requires-Dist: black>=25.1.0; extra == "dev"
Requires-Dist: pytest>=8.4.1; extra == "dev"
Requires-Dist: pytest-cov>=6.2.1; extra == "dev"
Dynamic: license-file

<p align="center"><img src="https://raw.githubusercontent.com/tn3w/crua/screenshot/crua.webp" alt="cr-ua - fast crawler detection for Python."></p>

<p align="center">
  <a href="https://pypi.org/project/crua/">
    <img src="https://img.shields.io/pypi/v/crua?style=for-the-badge" alt="PyPI Version">
  </a>
  <a href="https://github.com/tn3w/crua/actions/workflows/publish.yml">
    <img src="https://img.shields.io/github/actions/workflow/status/tn3w/crua/publish.yml?label=Publish&style=for-the-badge" alt="GitHub Workflow Status">
  </a>
  <a href="https://github.com/tn3w/crua/blob/master/LICENSE">
    <img src="https://img.shields.io/pypi/l/crua?style=for-the-badge" alt="License">
  </a>
</p>

<p align="center">
  <a href="https://github.com/tn3w/crua/tree/master/tests/fixtures">
    <img src="https://img.shields.io/badge/fixture_checks-98,472-0f766e?style=for-the-badge" alt="Fixture Checks">
  </a>
  <a href="https://github.com/tn3w/crua/tree/master/tests/fixtures">
    <img src="https://img.shields.io/badge/fixture_examples-33,964-1d4ed8?style=for-the-badge" alt="Fixture Examples">
  </a>
  <a href="https://github.com/tn3w/crua/blob/master/tests/fixtures/crawler_names.json">
    <img src="https://img.shields.io/badge/crawler_families-448-7c3aed?style=for-the-badge" alt="Crawler Families">
  </a>
</p>

<h3 align="center">Fast, whitelist-based crawler detection and browser parsing for Python.</h3>

<p align="center">
  CRUA keeps the public API intentionally small: use <code>is_crawler()</code> for a fast yes/no decision, or <code>parse()</code> to extract crawler, browser, engine, OS, and device metadata from the same user agent string.
  <br><br>
  Under the hood it uses Python regexes directly. If <code>re2</code> is installed, CRUA will prefer it automatically; otherwise it falls back to Python's built-in <code>re</code> module.
</p>

## Overview

- `is_crawler()` returns a boolean for crawler detection.
- `parse()` returns structured crawler and browser metadata.
- The public API is exposed from `crua`.
- The test suite uses fixture datasets from `tests/fixtures`.

## Test Data

The repository includes the following fixture coverage:

- `15,864` browser user agents expected to remain non-crawlers
- `1,248` crawler user agents expected to be detected as crawlers
- `16,015` browser parse rows checked for browser, version, engine, OS, and device fields
- `448` named crawler families covering `837` crawler instances

In total, the test suite uses `33,964` fixture examples and `98,472` fixture-backed checks.

## Requirements

- Python ≥ 3.9

## Installation

```bash
pip install crua
```

Optional faster regex backend:

```bash
pip install google-re2
```

For development:

```bash
pip install -e .[dev]
pytest
```

## Public API

```python
from crua import BrowserInfo, CrawlerInfo, UserAgent, is_crawler, parse
```

| API                                                                      | Description                                                   | Returns     |
| ------------------------------------------------------------------------ | ------------------------------------------------------------- | ----------- |
| `is_crawler(user_agent: str)`                                            | Detect whether a user agent should be classified as a crawler | `bool`      |
| `parse(user_agent: str, *, crawlers: bool = True, browser: bool = True)` | Parse a user agent into structured crawler and browser data   | `UserAgent` |

Public types:

- `UserAgent`: top-level parse result
- `CrawlerInfo`: crawler fields
- `BrowserInfo`: browser, engine, OS, and device fields

Import from `crua`. Names prefixed with `_` are internal.

## Usage

```python
from crua import is_crawler, parse

result = parse("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
               "(KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36")

result.is_crawler               # False
result.browser.browser          # "Chrome"
result.browser.browser_version  # "120.0.0.0"
result.browser.os               # "Windows"
result.browser.os_version       # "10/11"
result.browser.device           # "Desktop"
result.browser.engine           # "AppleWebKit"
result.browser.rendering        # "KHTML, like Gecko"

is_crawler("Googlebot/2.1 (+http://www.google.com/bot.html)")  # True

result = parse(ua, crawlers=True, browser=False)   # skip browser parsing
result = parse(ua, crawlers=False, browser=True)   # skip crawler detection
```

## Data Classes

### `UserAgent`

| Field        | Type                  | Description                                 |
| ------------ | --------------------- | ------------------------------------------- |
| `raw`        | `str`                 | Original UA string                          |
| `is_crawler` | `bool`                | Whether UA is a crawler/bot                 |
| `crawler`    | `CrawlerInfo \| None` | Crawler details (only if `is_crawler=True`) |
| `browser`    | `BrowserInfo \| None` | Browser/OS/device info                      |

### `CrawlerInfo`

| Field     | Type          | Description                       |
| --------- | ------------- | --------------------------------- |
| `name`    | `str \| None` | Crawler name (e.g. `"Googlebot"`) |
| `version` | `str \| None` | Crawler version                   |
| `url`     | `str \| None` | Info URL embedded in the UA       |

### `BrowserInfo`

| Field             | Type          | Description                                                        |
| ----------------- | ------------- | ------------------------------------------------------------------ |
| `product_token`   | `str \| None` | First product token (e.g. `"Mozilla/5.0"`)                         |
| `comment`         | `str \| None` | First parenthesised comment block                                  |
| `engine`          | `str \| None` | Layout engine: `AppleWebKit`, `Gecko`, `Trident`                   |
| `engine_version`  | `str \| None` | Engine version string                                              |
| `browser`         | `str \| None` | Browser name: Chrome, Firefox, Safari, Edge, …                     |
| `browser_version` | `str \| None` | Browser version string                                             |
| `os`              | `str \| None` | OS name: Windows, macOS, iOS, iPadOS, Android, Linux, …            |
| `os_version`      | `str \| None` | OS version string                                                  |
| `device`          | `str \| None` | `Desktop`, `Mobile`, `Tablet`, `SmartTV`, `Console`, or `Embedded` |
| `rendering`       | `str \| None` | Rendering hint (e.g. `"KHTML, like Gecko"`)                        |

## Detection Model

CRUA classifies a user agent as a crawler based on a whitelist-oriented browser check plus additional crawler heuristics.

Signals used by the detector include:

- bot keywords such as `bot`, `crawl`, `spider`, `scrape`, `fetch`, or `scan`
- embedded contact URLs or email addresses inside the UA
- browser-crawler markers such as `Chrome-Lighthouse`, `AppInsights`, `360Spider`, or `moatbot`
- suspicious compat, comment, or suffix patterns that do not match known browser tokens
- missing or inconsistent browser and platform signatures

## Supported Browsers

Chrome, Firefox, Safari, Edge (desktop/Android/iOS), Opera (OPR), Samsung Browser,
UC Browser, Yandex Browser, Google App (GSA), Chrome iOS (CriOS), Firefox iOS (FxiOS),
Huawei Browser, Amazon Silk, Brave.

## Supported OS / Platforms

Windows (XP through 10/11), Windows Mobile, macOS, iOS, iPadOS, Android,
Linux (Ubuntu, Fedora, Debian, CentOS, Arch, Mint, SUSE, Red Hat, Gentoo, Kali).

Device values include `Desktop`, `Mobile`, `Tablet`, `SmartTV`, `Console`, and `Embedded`.

## Development

```bash
pip install -e .[dev]
pytest
```

## Formatting

```bash
pip install black isort
isort . && black .
npx prtfm
```
