Metadata-Version: 2.4
Name: harbor-vm
Version: 0.1.5
Summary: A Harbor distribution focused on VM-backed agent evaluation for storage, distributed filesystem, and network systems.
Author: Alex Shaw
Author-email: Alex Shaw <alexgshaw64@gmail.com>
License-Expression: Apache-2.0
License-File: LICENSE
Requires-Dist: pydantic>=2.11.7
Requires-Dist: shortuuid>=1.0.13
Requires-Dist: typer>=0.16.0
Requires-Dist: requests>=2.32.4
Requires-Dist: pyyaml>=6.0.2
Requires-Dist: rich>=14.1.0
Requires-Dist: toml>=0.10.2
Requires-Dist: tenacity>=9.1.2
Requires-Dist: python-dotenv>=1.1.1
Requires-Dist: litellm>=1.80.8
Requires-Dist: jinja2>=3.1.6
Requires-Dist: dirhash>=0.5.0
Requires-Dist: dockerfile-parse>=2.0.1
Requires-Dist: e2b>=2.4.2
Requires-Dist: datasets>=4.4.1
Requires-Dist: runloop-api-client>=1.2.0
Requires-Dist: daytona>=0.121.0
Requires-Dist: kubernetes>=32.0.0
Requires-Dist: claude-agent-sdk>=0.1.17
Requires-Dist: packaging>=25.0
Requires-Dist: fastapi>=0.128.0
Requires-Dist: uvicorn>=0.38.0
Requires-Dist: modal>=1.4.0
Requires-Dist: ruff>=0.13.0
Requires-Dist: pathspec>=1.0.3
Requires-Dist: supabase>=2.28.2
Requires-Dist: libvirt-python>=10.0.0
Requires-Dist: paramiko>=3.4.0
Requires-Dist: tinker>=0.14.0 ; extra == 'tinker'
Requires-Dist: tinker-cookbook>=0.1.0 ; extra == 'tinker'
Requires-Python: >=3.12
Provides-Extra: tinker
Description-Content-Type: text/markdown

# Harbor VM

 [![](https://dcbadge.limes.pink/api/server/https://discord.gg/6xWPKhGDbA)](https://discord.gg/6xWPKhGDbA)
[![Docs](https://img.shields.io/badge/Docs-000000?style=for-the-badge&logo=mdbook&color=105864)](https://harborframework.com/docs)
[![Cookbook](https://img.shields.io/badge/Cookbook-000000?style=for-the-badge&logo=mdbook&color=105864)](https://github.com/harbor-framework/harbor-cookbook)

Harbor VM is an agent benchmark infrastructure for systems domains that require real virtual machines. It targets:

- **Distributed systems** — consensus protocols, leader election, quorum recovery
- **Network systems** — routing, bridging, iptables, multi-node topologies
- **Distributed filesystems** — NFS, GlusterFS, distributed block storage
- **Storage systems** — LVM, RAID, block devices, kernel modules
- **Research** — kernel tracing, fault injection, SRE recovery tasks

Harbor VM uses the same CLI and task format as Harbor, but runs workloads inside libvirt/KVM virtual machines instead of containers.

## Installation

> **Linux only.** Harbor VM requires a host with `/dev/kvm`. macOS and Windows are not supported.

Install host prerequisites:

```bash
sudo apt install qemu-kvm libvirt-daemon-system libvirt-clients \
                 genisoimage libguestfs-tools pkg-config
```

Install Harbor VM:

```bash
uv tool install harbor-vm
# or
pip install harbor-vm
```

After installation two CLI entrypoints are available:

```bash
harbor-vm --help
hbvm --help
```

## Running a VM task

Run any task from a local directory using the `vm` or `vm-cluster` environment type:

```bash
# Single-node VM task
harbor-vm run --path examples/tasks/vm-filesystem \
              --agent oracle \
              --env vm

# Multi-node cluster task
harbor-vm run --path examples/tasks/vm-nfs-share \
              --agent oracle \
              --env vm-cluster
```

Run against a dataset of VM tasks:

```bash
export ANTHROPIC_API_KEY=<your-key>
harbor-vm run --dataset my-vm-tasks@1.0 \
              --agent claude-code \
              --model anthropic/claude-opus-4-6 \
              --n-concurrent 4
```

Useful flags for debugging:

```bash
--force-build   # rebuild the VM image from scratch (skip cache)
--no-delete     # keep the VM alive after the trial ends for manual inspection
```

## Writing a VM task

A VM task has the same structure as a regular Harbor task, but uses `vm-setup.sh` and optionally `bootstrap.sh` instead of a Dockerfile.

```
my-task/
├── task.toml          # environment type, resources, timeout
├── instruction.md     # natural language description for the agent
├── environment/
│   ├── vm-setup.sh    # installs packages (runs once, cached)
│   └── bootstrap.sh   # optional: sets up initial state after boot
├── solution/
│   └── solve.sh       # reference solution
└── tests/
    └── test.sh        # verifier — writes reward to /logs/verifier/reward.txt
```

### task.toml

**Single-node VM:**

```toml
[task]
name = "myorg/my-task"

[environment]
type = "vm"
base_image = "ubuntu-24.04"  # see supported images below
cpus = 2
memory_mb = 4096
storage_mb = 20480           # root disk size in MiB

[agent]
timeout_sec = 300
```

**Multi-node cluster:**

```toml
[task]
name = "myorg/my-cluster-task"

[environment]
type = "vm-cluster"
network = "192.168.100.0/24"  # virtual bridge subnet

[[environment.nodes]]
name = "primary"    # agent connects here; can SSH to other nodes
cpus = 2
memory_mb = 4096
storage_mb = 20480

[[environment.nodes]]
name = "worker"
cpus = 2
memory_mb = 4096
storage_mb = 20480

# Optional: attach extra block devices to a node
[[environment.nodes.disks]]
name = "data"
size_mb = 10240   # appears as /dev/vdb inside the guest

[agent]
timeout_sec = 600
```

### vm-setup.sh

Runs inside every VM node after first boot (via cloud-init). Use it to install packages. Equivalent to a Dockerfile `RUN` step.

For `vm-cluster` tasks this step is **cached** — the post-setup disk state is saved and reused on subsequent runs, so agents don't wait for package installation on every trial.

```bash
#!/bin/bash
set -euo pipefail

apt-get update -qq
apt-get install -y --no-install-recommends nfs-kernel-server nfs-common
```

### bootstrap.sh (optional)

Runs on the **primary node only** after all nodes are booted and SSH-connected. Use it when the task needs a pre-built running state before the agent starts — for example: bootstrapping a cluster, seeding data, or injecting a fault for the agent to repair.

bootstrap.sh is **not cached**. It runs on every trial.

```bash
#!/bin/bash
set -euo pipefail

# Example: create the initial cluster state the agent needs to work with
/usr/local/bin/my-service init --cluster-ips "$(cat /etc/harbor/cluster-info)"
```

**Rule of thumb:**
- `vm-setup.sh` = what packages are installed (cached, run once)
- `bootstrap.sh` = what state the system is in when the agent starts (per-trial)

### Cluster info

On `vm-cluster` tasks, every node gets `/etc/harbor/cluster-info` with node IPs:

```
primary=192.168.100.10
worker=192.168.100.11
```

The primary node also has `/root/.ssh/id_ed25519` pre-configured to SSH into all other nodes without a password.

### Writing tests

The verifier (`tests/test.sh`) runs inside the VM and writes a reward between 0.0 and 1.0:

```bash
#!/bin/bash
set -euo pipefail
REWARD=0

if some_check_passes; then
    REWARD=$(echo "$REWARD + 0.5" | bc)
fi

echo "$REWARD" > /logs/verifier/reward.txt
```

For cluster tasks, tests run on the primary node and can SSH to workers:

```bash
VALUE=$(ssh -o StrictHostKeyChecking=no worker "cat /some/file")
```

## Supported base images

Base images are automatically downloaded from official cloud repos on first use.

| Name | OS |
|---|---|
| `ubuntu-24.04` | Ubuntu 24.04 LTS (default) |
| `ubuntu-22.04` | Ubuntu 22.04 LTS |
| `ubuntu-20.04` | Ubuntu 20.04 LTS |
| `debian-12` | Debian 12 (Bookworm) |
| `rocky-9` | Rocky Linux 9 |

You can also use an absolute path to a custom qcow2 image:

```toml
base_image = "/path/to/custom.qcow2"
```

## Internet isolation

Set `allow_internet = false` to boot VMs without NAT. Package installation still works because `vm-setup.sh` runs via `virt-customize` (pre-boot on the host) before the isolated network is created.

Requires `libguestfs-tools` on the host.

```toml
[environment]
type = "vm"
allow_internet = false
```

## Cleanup

VM work directories and libvirt resources accumulate over time. Clean them with:

```bash
harbor-vm vm clean             # interactive — shows what will be removed
harbor-vm vm clean -f          # skip confirmation
harbor-vm vm clean --dry       # preview only, no changes
harbor-vm vm clean --images    # also remove cached base images
```

## Debugging a failed task

Keep the VM alive after the trial ends:

```bash
harbor-vm run --path my-task --agent oracle --env vm --no-delete
```

Then SSH in manually:

```bash
ssh -i ~/.harbor/vm-work/harbor-<session-id>/id_ed25519 root@192.168.201.10
```

Check VM status with virsh:

```bash
virsh list --all          # list running harbor VMs
virsh console harbor-xxx  # serial console (Ctrl+] to exit)
virsh net-list --all      # list harbor networks
```

Force-clean a stuck VM or network:

```bash
virsh destroy harbor-xxx
virsh net-destroy harbor-net-xxx
```

## Example tasks

| Task | Type | What it tests |
|---|---|---|
| [`vm-single-node`](examples/tasks/vm-single-node/) | vm | Load a kernel module |
| [`vm-filesystem`](examples/tasks/vm-filesystem/) | vm | Loopback device + ext4 mount |
| [`vm-lvm-storage`](examples/tasks/vm-lvm-storage/) | vm | LVM volumes and snapshots |
| [`vm-network`](examples/tasks/vm-network/) | vm-cluster | iptables NAT routing between nodes |
| [`vm-nfs-share`](examples/tasks/vm-nfs-share/) | vm-cluster | NFS server/client file sharing |
| [`vm-etcd-cluster`](examples/tasks/vm-etcd-cluster/) | vm-cluster | etcd cluster with quorum survival |
| [`vm-cluster`](examples/tasks/vm-cluster/) | vm-cluster | Ceph distributed storage |

See the [VM task guide](https://harborframework.com/docs/tasks/vm-tasks) and [troubleshooting guide](https://harborframework.com/docs/tasks/vm-troubleshooting) for full documentation.

## Running standard Harbor benchmarks

Harbor VM can also run standard Docker-based Harbor workloads:

```bash
export ANTHROPIC_API_KEY=<your-key>
harbor-vm run --dataset terminal-bench@2.0 \
              --agent claude-code \
              --model anthropic/claude-opus-4-6 \
              --n-concurrent 4
```

```bash
harbor-vm datasets list     # browse all supported benchmarks
harbor-vm run --help        # full flag reference
```

## Packaging

```bash
uv build
uv tool install dist/harbor_vm-*.whl
```

The import package is `harbor`; the published distribution name is `harbor-vm`.

## Citation

```bibtex
@software{Harbor_Framework,
author = {{Harbor Framework Team}},
month = jan,
title = {{Harbor: A framework for evaluating and optimizing agents and models in container environments}},
url = {https://github.com/harbor-framework/harbor},
year = {2026}
}
```
