Metadata-Version: 2.4
Name: vcti-fileloader-hdf5
Version: 1.0.0
Summary: HDF5 file loader using h5py — tree extraction, node metadata, and dataset loading for the vcti-fileloader framework
Author: Visual Collaboration Technologies Inc.
Requires-Python: <3.15,>=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.24
Requires-Dist: h5py>=3.0
Requires-Dist: vcti-fileloader>=1.0.0
Requires-Dist: vcti-array-tree>=1.0.0
Provides-Extra: test
Requires-Dist: pytest; extra == "test"
Requires-Dist: pytest-cov; extra == "test"
Provides-Extra: lint
Requires-Dist: ruff; extra == "lint"
Provides-Extra: typecheck
Requires-Dist: mypy; extra == "typecheck"
Dynamic: license-file

# FileLoader HDF5

HDF5 file loader using h5py — tree extraction, node metadata, and dataset loading for the vcti-fileloader framework.

## When to Use This Loader

Use `vcti-fileloader-hdf5` when you need to inspect the **structure** of an
HDF5 file — groups, datasets, attributes — without reading every dataset
array into memory upfront. The separated loading design lets you:

- Browse the tree hierarchy first, then fetch only the datasets you need.
- Retrieve node metadata (names, types, byte sizes) for display or filtering
  before committing to a full data load.
- Load attributes selectively by node ID instead of scanning the whole file.

If you only need raw array access without tree/metadata introspection, use
h5py directly.

## Installation

```bash
pip install vcti-fileloader-hdf5>=1.0.0
```

---

## Quick Start

```python
from pathlib import Path
from vcti.fileloader_hdf5 import H5pyLoader, get_loader_descriptor
from vcti.fileloader import LoaderRegistry

# Context manager (recommended)
loader = H5pyLoader()
with loader.open(Path("data.h5")) as handle:
    tree = loader.load_tree(handle)
    info = loader.load_node_info(handle)
    node = loader.load_dataset(handle, node_id=2)

# Manual load/unload
loader = H5pyLoader()
handle = loader.load(Path("data.h5"))
try:
    tree = loader.load_tree(handle)
finally:
    loader.unload(handle)

# Registry-based usage
registry = LoaderRegistry()
registry.register(get_loader_descriptor())
desc = registry.get("hdf5-h5py-loader")
with desc.loader.open(Path("data.h5")) as handle:
    tree = desc.loader.load_tree(handle)
```

---

## Example Output

### `load_tree()` — structured array

Each row represents a node in the HDF5 hierarchy. Pointers use node IDs
(0 = no link).

```
id  parent_id  first_child_id  prev_sibling_id  next_sibling_id
 1          0               2                0                0   ← / (root)
 2          1               4                0                3   ← results/
 3          1               0                2                0   ← ids
 4          2               0                0                0   ← results/stress
```

### `load_node_info()` — structured array

```
id  name               type       size
 1  /                   group         0
 2  results             group         0
 3  ids                 dataset      24   ← 3 × int64 = 24 bytes
 4  results/stress      dataset      24   ← 3 × float64 = 24 bytes
```

### `load_dataset()` — DataNode

```python
node = loader.load_dataset(handle, node_id=4)
node.data          # np.array([1.0, 2.0, 3.0])
node.attributes    # {'units': 'MPa', 'type': 'dataset', 'shape': (3,), 'dtype': 'float64'}
```

---

## API

### H5pyLoader

| Method | Description |
|--------|-------------|
| `load(path, **options)` | Open HDF5 file, return h5py.File handle |
| `open(path, **options)` | Context manager — loads and auto-unloads |
| `unload(data)` | Close HDF5 file and clear cached mappings |
| `can_load(path)` | Check extension (.h5, .hdf5, .he5) |
| `load_tree(data)` | Tree structure as structured array |
| `load_node_info(data)` | Node metadata (id, name, type, size) |
| `load_attributes(data, node_ids)` | Attributes dict per node |
| `load_dataset(data, node_id)` | DataNode with array + attributes |

### Helpers

| | Description |
|---|---|
| `get_loader_descriptor()` | Create LoaderDescriptor for registry |
| `H5pyValidator` | Check h5py availability |
| `H5pySetup` | No-op setup (h5py needs no config) |

---

## Error Handling

The loader raises specific exceptions for different failure modes:

```python
from vcti.fileloader import LoadError, UnloadError, UnsupportedFormatError

loader = H5pyLoader()
try:
    with loader.open(Path("data.h5")) as handle:
        node = loader.load_dataset(handle, node_id=99)
except FileNotFoundError:
    # File does not exist at the given path
    ...
except UnsupportedFormatError:
    # File exists but is not a valid HDF5 file
    ...
except LoadError:
    # Other failure during file open (e.g., permissions)
    ...
except KeyError:
    # Node ID not found in load_dataset
    ...
except ValueError:
    # File handle is closed
    ...
```

---

## Performance

### Node map caching

On the first call to any load method, the loader walks the HDF5 hierarchy
once via `h5py.File.visit()` to build a bidirectional **path-to-ID /
ID-to-path** mapping. This mapping is cached per file handle (via
`WeakKeyDictionary`) and reused by all subsequent calls — `load_tree`,
`load_node_info`, `load_attributes`, `load_dataset` — so you never pay
for a second traversal.

### Memory overhead

The node map stores two Python dicts (path string and integer ID per node).
Rough overhead: **~200-300 bytes per node**. For a file with 100,000 nodes,
expect ~20-30 MB for the mapping alone. The structured arrays returned by
`load_tree` and `load_node_info` add ~20 bytes and ~300 bytes per node
respectively.

### Traversal time

`h5py.File.visit()` is backed by HDF5's C-level `H5Literate`, so
traversal is fast — typically **< 1 second for 100K nodes** on local SSD.
The bottleneck for large files is usually dataset I/O, not tree walking.

### Filtered vs. full attribute loading

- `load_attributes(handle)` — reads attributes for **every** node. Use
  this when you need a complete picture (e.g., building a search index).
- `load_attributes(handle, node_ids=np.array([2, 5]))` — reads only the
  specified nodes. Prefer this when you know which nodes you need, as it
  avoids touching unrelated HDF5 objects.

### Full array loading

`load_dataset()` reads the entire dataset into memory via `obj[:]`. For
very large datasets (multi-GB), consider using h5py slicing directly on
the file handle instead.

---

## Thread Safety

h5py file handles are **not thread-safe**. Do not share a single
`h5py.File` handle across threads. Instead, open a separate handle per
thread or serialize access with a lock.

---

## Dependencies

- [h5py](https://www.h5py.org/) (>=3.0)
- [numpy](https://numpy.org/) (>=1.24)
- [vcti-fileloader](https://pypi.org/project/vcti-fileloader/) (>=1.0.0)
- [vcti-array-tree](https://pypi.org/project/vcti-array-tree/) (>=1.0.0) — DataNode
