Metadata-Version: 2.4
Name: mcp-data-extractor
Version: 0.1.6.1
Summary: A commercial-grade MCP Server built on FastMCP, offering robust capabilities to read, extract, and localize (into Markdown) content from web pages and PDFs with both text and images. It is purpose-built for long-term deployment in enterprise environments.
Author-email: Aurelius <aureliusshu@gmail.com>
License-Expression: MIT
Keywords: mcp,web-crawler,scrapy,pdf,markdown
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: fastmcp>=2.11.0
Requires-Dist: scrapy>=2.11.0
Requires-Dist: aiohttp>=3.9.0
Requires-Dist: beautifulsoup4>=4.12.0
Requires-Dist: lxml>=5.0.0
Requires-Dist: requests>=2.32.0
Requires-Dist: selenium>=4.20.0
Requires-Dist: playwright>=1.45.0
Requires-Dist: undetected-chromedriver>=3.5.0
Requires-Dist: fake-useragent>=1.5.0
Requires-Dist: python-dotenv>=1.0.0
Requires-Dist: pydantic>=2.8.0
Requires-Dist: pydantic-settings>=2.0.0
Requires-Dist: httpx>=0.27.0
Requires-Dist: twisted>=24.7.0
Requires-Dist: ruff>=0.12.12
Requires-Dist: markitdown>=0.0.1a2
Requires-Dist: pypdf>=5.0.0
Requires-Dist: pymupdf>=1.26.4
Provides-Extra: dev
Requires-Dist: pytest>=8.0.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.23.0; extra == "dev"
Requires-Dist: pytest-cov>=6.0.0; extra == "dev"
Requires-Dist: pytest-html>=4.1.0; extra == "dev"
Requires-Dist: pytest-json-report>=1.5.0; extra == "dev"
Requires-Dist: pytest-mock>=3.14.0; extra == "dev"
Requires-Dist: pytest-xdist>=3.6.0; extra == "dev"
Requires-Dist: mypy>=1.10.0; extra == "dev"
Requires-Dist: pre-commit>=3.8.0; extra == "dev"
Requires-Dist: coverage[toml]>=7.6.0; extra == "dev"
Dynamic: license-file

Data Extractor is a commercial-grade MCP Server built on FastMCP, offering robust capabilities to read, extract, and localize (into Markdown) content from web pages and PDFs with both text and images. It is purpose-built for long-term deployment in enterprise environments.

## 🛠️ MCP Server Core Tools (14)

### Web Page

| 工具名称                               | 功能描述           | 主要参数                                                                                            |
| -------------------------------------- | ------------------ | --------------------------------------------------------------------------------------------------- |
| **scrape_webpage**                     | 单页面抓取         | `url`, `method`(自动选择), `extract_config`(选择器配置), `wait_for_element`(CSS 选择器)             |
| **scrape_multiple_webpages**           | 批量页面抓取       | `urls`(列表), `method`(统一方法), `extract_config`(全局配置)                                        |
| **scrape_with_stealth**                | 反检测抓取         | `url`, `method`(selenium/playwright), `scroll_page`(滚动加载), `wait_for_element`                   |
| **fill_and_submit_form**               | 表单自动化         | `url`, `form_data`(选择器:值), `submit`(是否提交), `submit_button_selector`                         |
| **extract_links**                      | 专业链接提取       | `url`, `filter_domains`(域名过滤), `exclude_domains`(排除域名), `internal_only`(仅内部)             |
| **extract_structured_data**            | 结构化数据提取     | `url`, `data_type`(all/contact/social/content/products/addresses)                                   |
| **get_page_info**                      | 页面信息获取       | `url`(目标 URL) - 返回标题、状态码、元数据                                                          |
| **check_robots_txt**                   | 爬虫规则检查       | `url`(域名 URL) - 检查 robots.txt 规则                                                              |
| **convert_webpage_to_markdown**        | 页面转 Markdown    | `url`, `method`, `extract_main_content`(提取主内容), `embed_images`(嵌入图片), `formatting_options` |
| **batch_convert_webpages_to_markdown** | 批量 Markdown 转换 | `urls`(列表), `method`, `extract_main_content`, `embed_images`, `embed_options`                     |

### PDF Document

| 工具名称                           | 功能描述        | 主要参数                                                                            |
| ---------------------------------- | --------------- | ----------------------------------------------------------------------------------- |
| **convert_pdf_to_markdown**        | PDF 转 Markdown | `pdf_source`(URL/路径), `method`(auto/pymupdf/pypdf), `page_range`, `output_format` |
| **batch_convert_pdfs_to_markdown** | 批量 PDF 转换   | `pdf_sources`(列表), `method`, `page_range`, `output_format`, `include_metadata`    |

### Service Management

| 工具名称               | 功能描述     | 主要参数                                  |
| ---------------------- | ------------ | ----------------------------------------- |
| **get_server_metrics** | 性能指标监控 | 无参数 - 返回请求统计、性能指标、缓存情况 |
| **clear_cache**        | 缓存管理     | 无参数 - 清空所有缓存数据                 |

## 🎯 Quick Navigation

- [用户指南](https://github.com/ThreeFish-AI/data-extractor/blob/master/docs/6-User-Guide.md)
- [架构设计](https://github.com/ThreeFish-AI/data-extractor/blob/master/docs/1-Framework.md)
- [开发指南](https://github.com/ThreeFish-AI/data-extractor/blob/master/docs/2-Development.md)
- [测试指南](https://github.com/ThreeFish-AI/data-extractor/blob/master/docs/3-Testing.md)
- [配置系统](https://github.com/ThreeFish-AI/data-extractor/blob/master/docs/4-Configuration.md)
- [常用指令](https://github.com/ThreeFish-AI/data-extractor/blob/master/docs/5-Commands.md)
- [版本里程](https://github.com/ThreeFish-AI/data-extractor/blob/master/CHANGELOG.md)

## 🤝 Contribution

欢迎提交 [Issue](https://github.com/ThreeFish-AI/data-extractor/issues) 和 [Pull Request](https://github.com/ThreeFish-AI/data-extractor/pulls) 来改进这个项目。

## 📄 License

MIT License - 详见 [LICENSE](LICENSE) 文件

---

**注意**: 请负责任地使用此工具，遵守网站的使用条款和 robots.txt 规则，尊重网站的知识产权。
