Metadata-Version: 2.4
Name: kr-building-name-normalizer
Version: 0.2.0
Summary: Normalize Korean apartment, building, and complex names into stable English display names.
Project-URL: Homepage, https://github.com/yeongseon/kr-building-name-normalizer
Project-URL: Repository, https://github.com/yeongseon/kr-building-name-normalizer
Project-URL: Issues, https://github.com/yeongseon/kr-building-name-normalizer/issues
Author: Yeongseon Choe
License-Expression: MIT
License-File: LICENSE
Keywords: apartment,building-names,korean,revised-romanization,romanization,transliteration
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Typing :: Typed
Requires-Python: >=3.10
Provides-Extra: dev
Requires-Dist: mypy>=1.10; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.4; extra == 'dev'
Provides-Extra: juso
Requires-Dist: httpx>=0.27; extra == 'juso'
Description-Content-Type: text/markdown

# kr-building-name-normalizer

[![PyPI](https://img.shields.io/pypi/v/kr-building-name-normalizer.svg)](https://pypi.org/project/kr-building-name-normalizer/)
[![Python Version](https://img.shields.io/badge/python-3.10%20%7C%203.11%20%7C%203.12%20%7C%203.13-blue)](https://pypi.org/project/kr-building-name-normalizer/)
[![CI](https://github.com/yeongseon/kr-building-name-normalizer/actions/workflows/ci-test.yml/badge.svg)](https://github.com/yeongseon/kr-building-name-normalizer/actions/workflows/ci-test.yml)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)

Normalize Korean apartment, building, and complex names into stable English display names.

## Why this exists

Korean apartment and building names combine brand names, geographic names, and descriptive terms — often written without spaces. Generic romanization tools produce garbled output because they don't understand the domain:

| Input | korean-romanizer | This library |
|-------|-----------------|--------------|
| `래미안 강남 센트럴파크` | `raemian gangnam senteureolpakeu` | `Raemian Gangnam Central Park` |
| `래미안강남센트럴파크` | `raemiangangnamsenteureolpakeu` | `Raemian Gangnam Central Park` |
| `힐스테이트` | `hilseuteiteu` | `Hillstate` |
| `그랜드밸리` | `geuraendeubaelri` | `Grand Valley` |
| `e편한세상` | `epyeonhansesang` | `e-Pyeonhansesang` |
| `SK뷰` | `skbyu` | `SK View` |

**Key differences from generic romanizers:**

- **Brand recognition**: 80+ apartment brands (래미안→Raemian, 힐스테이트→Hillstate, 자이→Xi)
- **Loanword restoration**: 90+ loanwords restored to English (밸리→Valley, 타운→Town, 그랜드→Grand)
- **No-space compound handling**: Dictionary-based longest-match tokenizer splits `래미안강남센트럴파크` correctly
- **Preferred English forms**: Uses established English names (센트럴파크→Central Park) instead of RR transliteration
- **Geographic names**: Seoul districts and neighborhoods with standard English spellings
- **Zero dependencies**: Pure Python, no external API calls needed

## Installation

```bash
pip install kr-building-name-normalizer
```

## Quick Start

```python
from kr_building_normalizer import romanize

# Basic usage
romanize("래미안 강남 센트럴파크")
# → "Raemian Gangnam Central Park"

# No-space compounds work too
romanize("래미안강남센트럴파크")
# → "Raemian Gangnam Central Park"

# Mixed Korean/ASCII
romanize("SK뷰")
# → "SK View"

# Unknown names fall back to Revised Romanization
romanize("한빛마을")
# → "hanbitma-eul"
```

## How it works

```
Input → Normalize → Tokenize (longest-match) → Lookup/RR → Join → Output

Pipeline for "래미안강남센트럴파크":
  1. Normalize whitespace
  2. Tokenize: ["래미안", "강남", "센트럴파크"] (longest-match from dictionaries)
  3. Lookup each token:
     - "래미안" → brands.json → "Raemian"
     - "강남" → geo_names.json → "Gangnam"
     - "센트럴파크" → preferred_terms.json → "Central Park"
  4. Join: "Raemian Gangnam Central Park"
```

Dictionary lookup order: brands → geo_names → preferred_terms → building_types → loanwords → RR fallback.

## Bundled Data

| Dictionary | Entries | Examples |
|-----------|---------|---------|
| `brands.json` | 80+ | 래미안→Raemian, 힐스테이트→Hillstate, 자이→Xi |
| `geo_names.json` | 50+ | 강남→Gangnam, 서초→Seocho, 마포→Mapo |
| `preferred_terms.json` | 30+ | 센트럴파크→Central Park, 타워→Tower |
| `building_types.json` | 10+ | 아파트→Apartment, 빌라→Villa |
| `loanwords.json` | 90+ | 밸리→Valley, 타운→Town, 그랜드→Grand |

Loanword mapping sources are documented in [SOURCES.md](SOURCES.md).

## Comparison with alternatives

| Feature | korean-romanizer | hangul-romanize | **kr-building-name-normalizer** |
|---------|-----------------|-----------------|-------------------------------|
| Revised Romanization | ✅ | ✅ | ✅ |
| Brand name dictionary | ❌ | ❌ | ✅ 80+ brands |
| No-space tokenization | ❌ | ❌ | ✅ longest-match |
| Preferred English forms | ❌ | ❌ | ✅ |
| Geographic names | ❌ | ❌ | ✅ Seoul districts |
| Zero dependencies | ✅ | ✅ | ✅ |
| Building domain focus | ❌ | ❌ | ✅ |

## Development

```bash
# Clone and install
git clone https://github.com/yeongseon/kr-building-name-normalizer.git
cd kr-building-name-normalizer
uv sync --extra dev

# Run quality checks
make check-all
```

See [CONTRIBUTING.md](CONTRIBUTING.md) for the full development guide.

## License

MIT
