Metadata-Version: 2.2
Name: atom-hifi
Version: 0.3.0
Summary: Atom-HiFi: atomistic high-fidelity representative-set selection framework
Author-email: Yihua Song <mothinesong@gmail.com>
License: MIT License
        
        Copyright (c) 2024 Yihua Song
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
Project-URL: Homepage, https://gitlab.mpcdf.mpg.de/yhsong/atom-hifi
Project-URL: Source, https://gitlab.mpcdf.mpg.de/yhsong/atom-hifi
Project-URL: Issue Tracker, https://gitlab.mpcdf.mpg.de/yhsong/atom-hifi/-/issues
Keywords: machine learning,interatomic potentials,training set,SOAP,atomic environments,active learning
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Physics
Classifier: Topic :: Scientific/Engineering :: Chemistry
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy
Requires-Dist: ase
Requires-Dist: matplotlib
Requires-Dist: pandas
Requires-Dist: scipy
Requires-Dist: scikit-learn
Requires-Dist: dscribe
Requires-Dist: pymoo>=0.6

# Atom-HiFi

**Atom**istic **Hi**gh-**Fi**delity representative-set selection framework.

Applications include:
- MLIP training-set curation and active-learning loops
- Chemical motif identification and distribution analysis
- Diversity-aware structure sampling from large databases

---

## What is Atom-HiFi?

Atom-HiFi selects the smallest subset **S** of structures such that the
atomic-environment distribution of **S** covers the full library with
user-specified fidelity **F**.  It works with any per-atom descriptor
(ED-SOAP built-in; MACE ACE supported) and is agnostic to the downstream
task — training-set curation, motif analysis, or database sampling.

---

## Key concepts

### Fidelity / Redundancy (F/R)

Each atom is assigned to a **microstate** (a Voronoi cell in the whitened
descriptor space produced by k-means).  **Fidelity** F measures how uniformly
the selected set's microstate population matches the full library; **Redundancy**
R measures how many atoms are packed per occupied microstate.

```
F = H(S) / H(L)
        H = -Σ p_i ln p_i   (Shannon entropy over microstate populations)

R = (N_S / k_occ^S) / (N_L / k_occ^L)
        N_S, N_L       = total atoms in selected set / full library
        k_occ^S, k_occ^L = occupied microstates in S / L
```

The scan sweeps a bandwidth parameter **c** (scaling factor on ε_noise) and
finds the operating point **c*** that maximises F/R subject to F ≥ F_TOL
(default 0.90).

### ED-SOAP descriptor

Two concatenated SOAP power-spectrum vectors per atom — one short-range
(bonding geometry) and one long-range (coordination shell) — normalised by a
system-specific `lengthscale`.  No GPU required.  The full parameter set is
exposed in `fr_workflow_tutorial.py` under the `EDS_*` variables.

---

## Installation

**Step 1 — install `decaf`** (required; not on PyPI):

```bash
pip install git+https://gitlab.mpcdf.mpg.de/klai/decaf.git
```

**Step 2 — install Atom-HiFi**:

```bash
pip install atom-hifi
```

> Python ≥ 3.9 required.

---

## Quick start

Copy the tutorial script to your working directory and set the four variables at
the top:

```bash
cp fr_workflow_tutorial.py ./my_run.py
```

Edit `my_run.py`:

```python
LIB_PATH       = 'train_structs.xyz'   # ASE-readable structure library
FOCUS_ELEMENTS = ['Ni', 'O']           # elements to cluster on
DESCRIPTOR     = 'eds'                 # 'eds', 'ace', or 'custom'
OUTPUT_DIR     = 'fr_results'
```

Run:

```bash
python -u my_run.py 2>&1 | tee fr_results/run.out
```

---

## Output files

| File | Description |
|---|---|
| `representatives.xyz` | Selected representative structures |
| `fine_scan.out` | F, R, FR, \|S\|, atoms for every fine-scan point |
| `FR_final.png` | Coarse + fine F/R scan diagnostic plot |
| `learning_curve.png` | AL loop convergence (only with `RUN_LOOP=True`) |
| `eps_noise_raw.npz` | Cached per-element ε_noise values |
| `desc_lib.pkl` | Cached per-structure descriptors |
| `surroundings_{el}.xyz` | Per-group coordination spheres (`EXTRACT_SURROUNDINGS=True`) |

---

## Configuration reference

All settings live at the top of `fr_workflow_tutorial.py`.

| Group | Variables |
|---|---|
| **Paths** | `LIB_PATH`, `PATIENT_PATH`, `FOCUS_ELEMENTS`, `OUTPUT_DIR` |
| **Descriptor** | `DESCRIPTOR`, `EDS_LENGTHSCALE`, `EDS_S_CUT`, `EDS_S_NMAX`, `EDS_S_LMAX`, `EDS_L_CUT`, `EDS_L_NMAX`, `EDS_L_LMAX`, `EDS_PERIODIC`, `EDS_R_CUT` |
| **Scan** | `F_TOL`, `N_COARSE`, `N_FINE`, `N_JOBS`, `C_FACTOR_RANGE` |
| **Refit** | `REFIT_DELTA`, `REFIT_GRID_POINT` |
| **Optional stages** | `RUN_LOOP`, `RUN_GRID_SCAN`, `RUN_NSGA2`, `EXTRACT_SURROUNDINGS` |

Full inline documentation for every variable is in the tutorial script.

---

## Advanced usage

<details>
<summary>Active-learning loop (<code>RUN_LOOP=True</code>)</summary>

Iteratively expands the training pool by sampling batches from the full library.
Inner iterations use a coarse scan only; one final fine scan runs at the end.
Set `INITIAL_SAMPLE` and `LOOP_SKIP_FINE_SCAN` to control the initial pool size
and inner-scan resolution.

</details>

<details>
<summary>Per-element ND grid scan (<code>RUN_GRID_SCAN=True</code>)</summary>

Sweeps independent c-factors per focus element on a Cartesian grid, reusing
cached per-element DECAF fits from the 1-D scan.  Cost is O(n^N_el) cover
evaluations instead of O(n^N_el × N_el) DECAF fits — tractable for N_el ≤ 3–4.
Results in `scan_grid.csv` and `scan_grid_report.png`.

</details>

<details>
<summary>NSGA-II Pareto optimisation (<code>RUN_NSGA2=True</code>)</summary>

Stochastic multi-objective optimisation of per-element c-factors via NSGA-II
(requires `pymoo`).  Use when the grid is too large (N_el ≥ 4) or you want a
continuous Pareto front.  Results in `pareto_front.csv` and three diagnostic
PNGs.

</details>

<details>
<summary>Representative environment extraction (<code>EXTRACT_SURROUNDINGS=True</code>)</summary>

Exports the local coordination sphere around the centroid-closest atom of each
DECAF group.  Two modes: `'sphere'` (non-periodic ASE Atoms cluster) and
`'full_structure'` (original cell with center/neighbour/rest tags).  Output:
`surroundings_{el}.xyz` per focus element.

</details>

---

## Citation

If you use Atom-HiFi in your research, please cite:

> [paper in preparation — citation will be added upon publication]
