Metadata-Version: 2.4
Name: browser-act-cli-lite
Version: 0.1.0
Summary: A stateless, parallel-safe, anti-detection CLI tool for extracting rendered web page content.
Author-email: BrowserAct <service@browseract.com>
License: MIT
Project-URL: Homepage, https://www.browseract.com
Keywords: browser,scraper,cli,anti-detection,stealth,markdown
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Operating System :: OS Independent
Classifier: Topic :: Internet :: WWW/HTTP
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: camoufox
Requires-Dist: click
Requires-Dist: langchain-text-splitters
Requires-Dist: markdownify==1.2.2
Dynamic: requires-python

# browser-act-lite

Stateless, parallel-safe, anti-detection CLI tool for extracting rendered web page content.

Based on [Camoufox](https://github.com/nicbarker/camoufox) stealth browser — each invocation launches a fresh browser instance with a unique fingerprint, extracts the fully rendered DOM (including iframes), and outputs clean HTML or Markdown.

## Features

- **Anti-detection** — Camoufox fingerprint rotation, headless stealth mode
- **Iframe extraction** — Recursively captures iframe contents and merges them into the output
- **DOM cleanup** — Strips hidden elements, inline styles, scripts, and SVG noise
- **Markdown conversion** — DOM → Markdown with absolute URL rewriting and heading-based chunking
- **Proxy support** — HTTP/SOCKS proxy with optional authentication
- **Parallel-safe** — Stateless design, safe to run multiple instances concurrently

## Requirements

- Python >= 3.10
- macOS / Linux / Windows

## Installation

```bash
pip install -e .
```

On first run the stealth browser engine will be downloaded automatically.

## Usage

### Extract as HTML

```bash
browser-act-lite stealth-extract https://example.com -f html
```

### Extract as Markdown

```bash
browser-act-lite stealth-extract https://example.com -f markdown
```

### Save to file

```bash
browser-act-lite stealth-extract https://example.com -f markdown -o
```

Output is saved to `outputs/<hostname>_<timestamp>.md`.

### With proxy

```bash
browser-act-lite stealth-extract https://example.com -f html -p http://user:pass@host:port
```

### Options

```
Usage: browser-act-lite stealth-extract [OPTIONS] URL

Options:
  -f, --format [html|markdown]  Output format (required)
  -p, --proxy TEXT              Proxy URL, e.g. http://user:pass@host:port
  -t, --timeout INTEGER         Page load timeout in seconds [default: 30]
  -o, --output                  Save to outputs/ directory instead of stdout
  --help                        Show this message and exit
```

## Project Structure

```
src/browser_act_lite/
├── cli.py              # Click CLI entry point
├── extractor.py        # Core extraction: launch browser → navigate → extract
├── engine.py           # Stealth browser engine config & monkey-patches
└── pipeline/
    ├── __init__.py     # html_to_markdown / markdown_split
    ├── dom_filter.py   # DOM evaluation & iframe extraction (Playwright)
    ├── converter.py    # Markdownify customisation
    ├── url.py          # URL absolutification
    └── js/
        └── dom_html.js # In-page JS for DOM serialisation
```

## License

MIT
