Metadata-Version: 2.4
Name: loghunter-cli
Version: 0.1.0.dev0
Summary: ML-assisted network and log analysis toolkit for security practitioners and threat hunters.
Author-email: David Augros <code@augros.org>
License: MIT
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas>=2.0
Requires-Dist: numpy>=1.26
Requires-Dist: scikit-learn>=1.3
Requires-Dist: hdbscan>=0.8
Requires-Dist: drain3>=0.9
Requires-Dist: tqdm>=4.0
Requires-Dist: tldextract>=3.0
Provides-Extra: fast
Requires-Dist: fast-hdbscan>=0.2; extra == "fast"
Provides-Extra: splunk
Requires-Dist: splunk-sdk; extra == "splunk"
Provides-Extra: cloudtrail
Requires-Dist: boto3; extra == "cloudtrail"
Requires-Dist: botocore[crt]; extra == "cloudtrail"
Provides-Extra: all
Requires-Dist: loghunt[fast]; extra == "all"
Requires-Dist: loghunt[splunk]; extra == "all"
Requires-Dist: loghunt[cloudtrail]; extra == "all"
Dynamic: license-file

# LogHunter

LogHunter is a local-first command-line threat-hunting workbench for self-hosters. You
point it at the logs you already have — Zeek, Pi-hole/dnsmasq, syslog, CloudTrail — and it
tells you what's in them and runs transparent detectors over them: beaconing, suspicious
DNS, port scans, rare syslog events, abnormally long connections, and unusual CloudTrail
activity. Every run names the technique behind each detector, so you always know whether a
finding came from a published algorithm or an honest heuristic.

**Not a SIEM. Not an agent. Not magic.** Nothing to deploy, no database, no daemon, no
account. Install it, point it at a directory of logs, read the output. It runs on the
admin's own box, over logs at rest.

[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](#license)
![Python 3.11+](https://img.shields.io/badge/python-3.11%2B-blue.svg)

> **Status: early / pre-1.0 (`0.1.0.dev0`).** The six detectors below work and are
> covered by tests, but interfaces may still move before 1.0. Feedback welcome.

<!-- TODO(screenshots): a real terminal capture of `loghunter ~/zeek` and a `digest` card go here. -->

A run opens with a summary banner — what was loaded, and which technique each detector
used — then groups findings by detector (illustrative output; addresses are
[RFC 5737](https://datatracker.ietf.org/doc/html/rfc5737) documentation space):

```
LogHunter  ·  Threat Hunt
══════════════════════════════════════════════════════════════════════════════
Data found:  2026-05-31 00:00  →  2026-06-01 00:00  (24h)
Records:     1,284,402 conn.log  ·  318,221 dns.log  ·  44,019 *.log
Detectors:   beacon (FFT)  ·  dns (fast-HDBSCAN)  ·  syslog (drain3)  ·  scan [pattern]  ·  duration [heuristics]
══════════════════════════════════════════════════════════════════════════════

beacon — 2 findings · 1 H  1 M
────────────────────────────────────────────────────────────────────────────────
[H]  192.0.2.37 → 198.51.100.20:443/tcp     score 0.91   period 60.0s    1,440 conns
[M]  192.0.2.37 → 198.51.100.61:8443/tcp    score 0.74   period 300.0s     288 conns

dns — 1 finding · 1 M
────────────────────────────────────────────────────────────────────────────────
[M]  dga-lookups.example   entropy 3.91   14 subdomains   cluster -1 (noise)
```

The two-tier styling of the `Detectors:` line is deliberate: published techniques glow in
parentheses — `(FFT)`, `(HDBSCAN)`, `(drain3)` — while honest house methods are plain in
brackets — `[pattern]`, `[heuristics]`, `[statistical]`. The restraint is the point. A
heuristic is never dressed up as an algorithm, which is what makes the glow trustworthy.

## Quick start

```bash
pip install loghunt

# one-time, detection-driven setup — finds your logs and writes a config
loghunter init

# hunt across everything enabled in your config
loghunter

# or point at a directory / file directly
loghunter ~/zeek-logs
loghunter syslog /var/log

# orient before you hunt — a fast, factual profile of a single file
loghunter digest /var/log/messages
```

No config file is required to get started — `loghunter <path>` works against a directory or
a single file. `loghunter init` just makes it repeatable.

## Why use LogHunter?

- **It runs where your logs are.** No services, no database, no daemon, no agent to push.
  `pip install`, point it at a directory, get output. The only setup step that exists at all
  is `loghunter init`, and that only writes a config file.
- **Real methods, made visible.** Beaconing is found with an FFT over connection timing;
  DNS with HDBSCAN clustering over per-query behavior; rare syslog events with drain3
  log-templating plus rarity scoring; CloudTrail with a transparent per-principal z-score
  composite. Every run tells you which technique ran. You can read *why* something was
  surfaced — no black box.
- **Big-tent ingestion.** One tool reads Zeek (NDJSON *and* TSV, flat *or* date-partitioned
  directories), Pi-hole/dnsmasq, flat RFC 3164 syslog (Debian *and* RHEL/Fedora layouts),
  and CloudTrail. Rotation and `.gz`/`.bz2`/`.xz` compression are handled transparently.
- **Orient before you hunt.** `loghunter digest FILE` reads a log and reports facts about
  it — time span, top talkers, the shape of the mix — with zero verdicts. It's sonar, not a
  baggage scanner: it tells you what's there so you know where to point the detectors.
- **Filter before analyze.** A flat-file allowlist suppresses known-good infrastructure
  *before* any detector sees the data, so your noise floor is yours to set and detectors
  never have to know the allowlist exists.
- **Honest output.** Findings carry a severity, the evidence behind the score, and (with
  `-v`/`-vv`) the analyst pivots to chase next. Machine formats (`json`, `csv`, `html`) are
  lossless; the terminal view is the one that summarizes.

## Why *not* use LogHunter?

- **It is not real-time and not a SIEM.** It runs over logs at rest, in batches. There's no
  streaming, no alerting pipeline, no live correlation across sources at scale. If you need an
  always-on detection platform, you need a SIEM; LogHunter is the workbench you reach for to
  *hunt*.
- **It is stateless between runs.** There's no persisted baseline and no rolling history.
  CloudTrail "first-seen" novelty, for example, is relative to the window you loaded — not to
  all of recorded time.
- **Detector coverage is v1.** Six detectors ship today (below). `auth`, `ssl`, `protocol`,
  and `weird` are planned but not built.
- **The richest network signal wants Zeek.** Pi-hole/dnsmasq gives you DNS only — no RTT,
  TTL, or connection correlation. LogHunter will tell you so and keep working, but Zeek is
  where it shines.
- **It surfaces, it doesn't block.** This is a tool for a human triaging behavior, not a
  signature IDS or an enforcement point.

## What it hunts

| Detector  | Surfaces                                            | Method                       | Source                         |
|-----------|-----------------------------------------------------|------------------------------|--------------------------------|
| `beacon`  | periodic C2-style callbacks                         | FFT over connection timing   | Zeek `conn.log`                |
| `dns`     | DGA / tunneling / anomalous lookups                 | HDBSCAN clustering           | Zeek `dns.log` **or** Pi-hole  |
| `syslog`  | rare events & reboots                               | drain3 templating + rarity   | syslog (flat) **or** Zeek `syslog.log` |
| `scan`    | vertical / horizontal / block / slow port scans     | pattern (heuristic)          | Zeek `conn.log`                |
| `duration`| abnormally long-lived connections                   | heuristics                   | Zeek `conn.log`                |
| `aws`     | per-principal anomalous CloudTrail behavior         | statistical (z-score composite) | CloudTrail `*.json`         |

`dns` and `syslog` each answer **one** question across **two** source families — Zeek and
Pi-hole for DNS, flat rsyslog and Zeek's own `syslog.log` for syslog — and adapt to whichever
fidelity they're handed.

Run them all (`loghunter`), select some (`loghunter --detect=beacon,dns`), or exclude
(`loghunter --detect='all,!syslog'`). Each detector is also its own subcommand:
`loghunter beacon ~/zeek`.

## How a run works

```
discover & parse  →  allowlist (suppress)  →  detect  →  render
```

Responsibilities don't bleed across that line. The **loader** finds files, decompresses,
normalizes every connection source to one canonical schema, and absorbs storage variation
(TSV vs. NDJSON, flat vs. dated directories, rotation). The **allowlist** suppresses
known-good traffic *before* analysis. **Detectors** only analyze — they never open files,
read config, or suppress. **Output handlers** only render. The CLI is the one place that
turns an error into an actionable message and owns the exit code.

Because detectors are pure analysis, every one is importable and callable as an ordinary
Python function — useful in a notebook when you want to experiment.

### Analysis window

Pointed at a **directory**, an unqualified run looks back over the last `default_window`
(`1d` out of the box) of *that source's own* data — the right default for a live log dir
you don't want to read in full every time. Pointed at a **single file**, it reads the whole
file. Override either way:

```bash
loghunter --since=7d ~/zeek            # last 7 days
loghunter --since=2026-05-01 --until=2026-05-08 ~/zeek
loghunter --days=2-4 ~/zeek            # 2 to 4 days ago
loghunter --all ~/zeek                 # the entire archive
```

CloudTrail is the one source that opts out of the default window — novelty detection needs
full history, so it always loads in full unless you narrow it explicitly.

## Orient before the hunt: `digest`

```bash
loghunter digest /var/log/messages
loghunter digest conn.log dns.log         # several files → several cards
```

`digest` content-sniffs each file, routes it to the right summarizer (conn, dns, syslog,
cloudtrail), and falls back to a fast byte-profiler — **blob** — for anything it doesn't
recognize. A card is flush-left and factual: the file's time window, line count and size, a
scale-anchored histogram, and a handful of plain-language insights ("one client accounts for
71% of queries"). It states facts and superlatives, never verdicts — no "suspicious," no
"anomalous." It reads your data *before* the allowlist, because everything in the file,
allowlisted or not, is part of "what's in here." The blob profiler is bounded: it samples a
big file rather than reading it, so a one-gigabyte mystery file costs the same as a
one-kilobyte one.

## Installation

LogHunter is published on PyPI as **`loghunt`** (the command, import package, and config
section are all `loghunter`).

```bash
pip install loghunt                 # core
pip install 'loghunt[fast]'         # fast-hdbscan accelerator for DNS clustering
pip install 'loghunt[splunk]'       # Splunk exporter
pip install 'loghunt[cloudtrail]'   # CloudTrail (S3) exporter
pip install 'loghunt[all]'          # everything above
```

Requires **Python 3.11+**. A bare `pip install loghunt` always works — the DNS clustering
runs on stock `hdbscan` (a base dependency); `[fast]` swaps in a numba-accelerated backend
when you want it, and the tool tells you which one is active on every run.

From source:

```bash
git clone https://github.com/spiralbend/loghunter
cd loghunter
pip install -e '.[all]'
```

## Configuration

Configuration is optional — LogHunter runs against a path with none. When you want it
repeatable, `loghunter init` looks at the conventional locations on your box, profiles what
it finds (which log families, rough size, freshness — without reading a single log line),
and writes a fully-annotated `~/.loghunter/config.toml`. It never clobbers settings you
already have.

Config is loaded from the first of:

1. `--config=FILE`
2. `~/.loghunter/config.toml`
3. `/etc/loghunter/config.toml`

Everything LogHunter owns lives under the hidden `~/.loghunter/` — config, allowlists,
exports, reports — so it can't collide with a project directory. A trimmed example:

```toml
[loghunter]
detect     = "all"                 # "all" | "dns,beacon" | "all,!syslog"
zeek_dir   = "/var/log/zeek"
syslog_dir = "/var/log"
# pihole_dir     = "/var/log/pihole"
# cloudtrail_dir = "/var/log/cloudtrail"

home_net       = ["10.0.0.0/8", "172.16.0.0/12", "192.168.0.0/16"]
default_window = "1d"              # lookback for a directory; "" or "all" = full
output_format  = "text"           # text | json | csv | html
```

Findings print to your terminal by default — keep it pipeable. Set `report_dir` (or pass
`--out=PATH`) to write report files instead. Every tunable a detector exposes is documented
as a commented "engine room" at the bottom of the generated config; you rarely need it, and
`loghunter <detector> --help` lists the full surface.

## Log sources it speaks

- **Zeek** — `conn.log`, `dns.log`, `syslog.log`, in NDJSON or TSV, from a flat directory or
  date-partitioned subdirectories. Rotation and gzip/bzip2/xz compression are transparent.
- **Pi-hole / dnsmasq** — DNS event logs, aggregated per domain for clustering.
- **syslog** — flat RFC 3164. Discovery is content-sniffed, not filename-matched, so it
  handles both the Debian convention (`syslog`, `auth.log`, `kern.log`) and the RHEL/Fedora
  one (extensionless `messages`, `secure`, `maillog`) — and won't mistake `dnf.log` or a
  binary like `wtmp` for a log stream.
- **CloudTrail** — gzipped JSON event records, read locally or pulled from S3 (below).

## The allowlist

Two kinds of allowlist file, never conflated:

- **Flat files = suppression.** One rule per line — an IP, a CIDR, a `:port/proto`, or a
  domain glob/regex. Matching traffic is dropped before any detector runs. LogHunter ships a
  curated domain list and never ships numeric connection suppressions (those depend on your
  hosts, and shipping them could hide real findings).
- **TOML stanzas = classification.** When a detector needs to know *what* something is
  (a nameserver, a backup client) rather than whether to drop it.

A bare host IP with no port suppresses *all* traffic involving that host — powerful, and
called out as such wherever it appears.

## Pulling logs in: exporters

LogHunter can fetch logs from external systems to local files, which it then analyzes like
any other source — the syslog detector can't tell whether the data came from rsyslog or a
Splunk export.

```bash
loghunter export            # run the configured "default" query
loghunter export auth       # run a named query
```

- **Splunk** — named SPL queries under `[export.splunk.query.<name>]`. Prefer the
  `LOGHUNTER_SPLUNK_USER` / `LOGHUNTER_SPLUNK_PASS` environment variables over plaintext
  credentials in config.
- **CloudTrail** — pulls gzipped JSON from an S3 prefix. AWS authentication is *not* handled
  here: you authenticate your shell, and boto3 resolves the ambient credential chain.
  LogHunter never reads, stores, or prompts for AWS credentials, and warns before a large
  egress.

## Output formats

`text` (default, grouped and summarized), `json` (one finding per line, pipeable), `csv`
(flattened), and `html` (a self-contained file). Pass `--output=json` or set `output_format`
in config. `-v` adds the curated "why it scored" detail; `-vv` adds raw debug — template
strings, cluster membership, full evidence. Color is enhancement-only and TTY-gated: piped
or redirected output is always plain, and the machine formats never emit an escape code.

## Building from source & running tests

```bash
git clone https://github.com/spiralbend/loghunter
cd loghunter
pip install -e '.[all]'
python -m pytest
```

`main` is kept runnable. Architecture tests cover the boundaries that matter — detector
discovery, run planning, loader metadata, allowlist suppression, output registration, and
CLI error formatting.

## License

LogHunter is licensed under the [MIT License](LICENSE).
