Metadata-Version: 2.4
Name: sim-pdf-chunker
Version: 0.0.1
Summary: A simple PDF chunking utility for splitting PDF text into manageable pieces
Home-page: https://github.com/uwpark/sim-pdf-chunker
Author: uwpark
License: MIT
Project-URL: Homepage, https://github.com/uwpark/sim-pdf-chunker
Project-URL: Issues, https://github.com/uwpark/sim-pdf-chunker/issues
Keywords: pdf,chunking,text-splitting,nlp,rag
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Text Processing
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pypdf>=4.0.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: build; extra == "dev"
Requires-Dist: twine; extra == "dev"
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-python

# sim-pdf-chunker

PDF 파일에서 텍스트를 추출하고, 지정한 크기로 청크(chunk)를 만들어주는 간단한 파이썬 패키지입니다. RAG(Retrieval-Augmented Generation) 파이프라인이나 임베딩 전처리에 유용합니다.

> 설치 이름은 `sim-pdf-chunker`, import 이름은 `pdf_chunker` 입니다.

## 설치

```bash
pip install sim-pdf-chunker
```

로컬에서 빌드/설치:

```bash
pip install -e .
```

## 사용 예시

### Python API

```python
from pdf_chunker import PDFChunker

chunker = PDFChunker(chunk_size=1000, chunk_overlap=200)

# 파일 경로에서 청크 생성
chunks = chunker.chunk_file("sample.pdf")
for c in chunks:
    print(c.page, c.index, c.text[:80])

# 텍스트에서 직접 청크 생성
chunks = chunker.chunk_text("긴 텍스트...")
```

### CLI

```bash
sim-pdf-chunker sample.pdf --chunk-size 1000 --overlap 200 --output chunks.jsonl
```

## 옵션

- `chunk_size` (int): 청크 하나의 최대 글자 수 (기본 1000)
- `chunk_overlap` (int): 인접 청크 간 겹치는 글자 수 (기본 200)
- `separator` (str): 청크 분할 시 우선적으로 자를 구분자 (기본 `"\n"`)

## 라이선스

MIT
