Metadata-Version: 2.4
Name: UniCrawler
Version: 0.1.0
Summary: A flexible web crawling framework with browser automation, intelligent parsing, and modular architecture
Author-email: UniCrawler Team <inficonn@proton.me>
License: MIT License
        
        Copyright (c) 2025 UniCrawler Contributors
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
Keywords: web-crawling,web-scraping,browser-automation,data-extraction,python,async
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Internet :: WWW/HTTP
Classifier: Topic :: Internet :: WWW/HTTP :: Indexing/Search
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: beautifulsoup4
Requires-Dist: lxml
Requires-Dist: pydantic
Requires-Dist: aiohttp
Requires-Dist: aiofiles
Requires-Dist: playwright
Requires-Dist: pychrome
Requires-Dist: aiomysql
Requires-Dist: loguru
Requires-Dist: numpy
Requires-Dist: SQLAlchemy
Requires-Dist: pymysql
Requires-Dist: psycopg2-binary
Requires-Dist: pandas
Requires-Dist: openpyxl
Requires-Dist: tenacity
Requires-Dist: async-timeout
Requires-Dist: nest_asyncio
Requires-Dist: orjson
Requires-Dist: ujson
Requires-Dist: python-dotenv
Requires-Dist: dataclasses-json
Requires-Dist: humanfriendly
Requires-Dist: colorama
Requires-Dist: openai
Requires-Dist: Pillow
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.21.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: mypy>=1.0.0; extra == "dev"
Requires-Dist: pylint>=2.16.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: isort>=5.12.0; extra == "dev"
Requires-Dist: flake8>=6.0.0; extra == "dev"
Provides-Extra: browser-use
Requires-Dist: browser-use>=1.0.0; extra == "browser-use"
Provides-Extra: llm
Requires-Dist: openai>=1.0.0; extra == "llm"
Provides-Extra: all
Requires-Dist: UniCrawler[dev]; extra == "all"
Requires-Dist: UniCrawler[browser-use]; extra == "all"
Requires-Dist: UniCrawler[llm]; extra == "all"
Dynamic: license-file

# UniCrawler: The AI-Native & Natural Language Driven Web Crawler 🕷️✨

> **Stop writing selectors. Start describing data.**
> UniCrawler redefines web scraping by combining browser automation with Large Language Model (LLM) intelligence.

## Why UniCrawler? 🚀

Traditional crawlers break when a `div` moves. UniCrawler understands the page like a human does.

- **🗣️ Natural Language Driven**: **No more CSS/XPath.** Just tell UniCrawler what you want (e.g., "Get all product prices and titles"), and it translates your intent into executable actions.
- **🧠 AI-Powered Parsing**: Forget regex. Our LLM-based parsers extract structured data from messy HTML, automatically handling missing fields and normalization.
- **👀 Visual Intelligence**: Uses Computer Vision and DOM analysis to interact with dynamic pages (Amazon, Tiktok, Alibaba, etc.) just like a real user.

## UniCrawler vs. Traditional Frameworks ⚔️

| Feature | Traditional (Scrapy, Selenium...) | UniCrawler |
| :--- | :--- | :--- |
| **Configuration** | Complex CSS/XPath selectors | Natural Language Descriptions |
| **Maintenance** | Breaks on minor layout changes | **Self-Healing**: Adapts to UI updates |
| **Parsing** | Rigid rules & Regex | **Semantic Extraction**: Understands context |
| **Anti-Scraping** | Manual header/proxy management | Human-like behavior simulation |
| **Learning Curve** | Steep (requires HTML/JS knowledge) | **Low**: Focus on business logic |

## Core Features

### 1. Intelligent Crawler Module
*   **Hybrid Engine**: Seamlessly switches between Chrome DevTools Protocol (CDP) for dynamic rendering and bypassing anti-scraping mechanisms, and lightweight HTTP requests for speed.
*   **Semantic Actions**: Operators work by understanding the element's purpose (e.g., "Next Page", "Add to Cart"), making your scripts robust to layout changes.
*   **Strategic Collection**: Built-in strategies for snapshots, pagination, and incremental updates.

### 2. Smart Data Parsing
*   **Flexible Pathways**: Supports both structured data filtering and smart extraction from unstructured pages.
*   **LLM-Assisted**: Uses Large Language Models to clean data, fill missing fields, and normalize formats automatically.
*   **Multi-Format**: Accepts JSON, HTML, or DOM fragments.

### 3. One-Click Storage & Expansion
*   **Auto-Schema**: Automatically detects data structures and creates database tables on the fly.
*   **Unified Interface**: No need to manage ORM or connection pools—just call `writer.write_to_db`.
*   **Evolvable Architecture**: Decoupled Crawler, Parser, and Writer modules allow for easy extension without breaking core logic.

## Quick Start

### Python Version

* Recommended: 3.9+

### Installation

Install the latest release from the official PyPI index:

```bash
pip install unicrawler
```

### Start the Browser (CDP Debugging Port)

You have two options: use the built-in CLI (recommended) or the provided scripts.

#### Option A: Built-in CLI (Cross-Platform)

Install the package, then run:

```powershell
# Windows PowerShell
unicrawler-start-chrome --port 9222 --profile .\chrome_cdp_profile --headless
```

```bash
# Linux / macOS
unicrawler-start-chrome --port 9222 --profile ./chrome_cdp_profile --headless
```

CLI flags:
- `--port`: remote debugging port (default `9222`)
- `--profile`: user data dir (default `./chrome_cdp_profile`)
- `--headless` / `--no-headless`: force headless or non-headless; by default Windows is non-headless, Linux/macOS is headless
- `--incognito`: start in incognito mode
- `--chrome-path`: explicit chrome executable if auto-detection fails

Check the port after startup:

```text
http://127.0.0.1:9222/json/version
```

Programmatic API:

```python
from unicrawler import start_chrome_cdp

# auto headless by platform (Windows: False; Linux/macOS: True)
proc = start_chrome_cdp(port=9222, profile_dir="./chrome_cdp_profile", incognito=False)
```

#### Option B: Provided Scripts

```powershell
powershell -ExecutionPolicy Bypass -File scripts\start_chrome_cdp.ps1 -Port 9222 -ProfileDir .\chrome_cdp_profile
```

```bash
bash scripts/start_chrome_cdp.sh --port 9222 --profile-dir ./chrome_cdp_profile --headless
```

### Usage and Example (Amazon Example)

Unified usage instructions and complete examples are integrated here for easy reference.

**Step 1**: Start the Browser (Mandatory)
Follow the previous section to start the browser and ensure the debugging port is ready.

**Step 2**: Run the Amazon Example (Crawl + Parse + Store)
Example script: `use_test.py`

```bash
# Run the example (Windows / Linux / macOS compatible)
python use_test.py
```

The example performs the following steps:

1. Collects Amazon search results for the keyword “table” (2 pages, up to 10 results).
2. Parses and filters structured fields like image links and titles.
3. Displays the top results in the console.
4. Saves the complete results to `result.json`.
5. Writes the data to the database (ensure database connection is configured correctly).

### Python Example Code

```python
from unicrawler import crawler, parser, writer
from unicrawler.config import PostgreSQLConfig
import json

# Amazon search configuration
url = "https://www.amazon.com/s?k=table"
what = "image links, titles"  # <--- Natural Language Description: No CSS selectors needed!

# Crawl data
crawl_result = crawler.crawl(url, what=what, page_limit=2, item_limit=10)
raw_items = crawl_result.pages if crawl_result and crawl_result.pages else []

# Parse data
parsed_items = parser.parse(raw_items, what=what, mode="auto")

# Print results
print(f"\nCollected {len(parsed_items)} items\n")
for i, item in enumerate(parsed_items[:5], 1):
    print(f"[{i}] {item}")

# Save as JSON
with open("result.json", "w", encoding="utf-8") as f:
    json.dump(parsed_items, f, ensure_ascii=False, indent=2)
print(f"\nData saved to result.json")

# Database configuration
db_config = PostgreSQLConfig(
    host="localhost",
    port=5432,
    db="testdbforunicrawler",
    table="products",
    user="postgres",
    password="yourpassword",
    schema="public"
)

# Write to database
rows_written = writer.write_to_db(parsed_items, db_config=db_config)
print(f"Successfully wrote {rows_written} rows to the database")
```

### Database Configuration (Required to Enable Database Storage)

Fill in your database connection information in the `use_test.py` script (PostgreSQLConfig):

* `host/port/db/schema`
* `user/password`
* `table` (suggest naming by site to avoid mixing data)

After completing the configuration, the script will collect and parse the data, then store it in the database, displaying the number of rows written.
