Metadata-Version: 2.4
Name: ssh-discovery
Version: 0.1.1
Summary: Discovery library for remote log workflows over SSH/SFTP
Author: Timur Anvar
License: MIT
Project-URL: Homepage, https://github.com/DreamyStranger/ssh-discovery
Project-URL: Repository, https://github.com/DreamyStranger/ssh-discovery
Project-URL: Issues, https://github.com/DreamyStranger/ssh-discovery/issues
Keywords: ssh,sftp,log-discovery,sqlite,paramiko
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: System :: Logging
Classifier: Topic :: System :: Networking
Classifier: Typing :: Typed
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: paramiko>=3.4.0
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: pytest-cov>=5.0; extra == "dev"
Requires-Dist: ruff>=0.6.0; extra == "dev"
Requires-Dist: mypy>=1.10.0; extra == "dev"
Requires-Dist: types-paramiko>=3.4.0; extra == "dev"
Dynamic: license-file

# ssh-discovery

[![CI](https://github.com/DreamyStranger/ssh-discovery/actions/workflows/ci.yml/badge.svg?branch=master)](https://github.com/DreamyStranger/ssh-discovery/actions/workflows/ci.yml?query=branch%3Amaster)
[![PyPI version](https://img.shields.io/pypi/v/ssh-discovery.svg)](https://pypi.org/project/ssh-discovery/)
[![Tested with pytest](https://img.shields.io/badge/tested%20with-pytest-0A9EDC.svg)](https://pytest.org/)
[![Python 3.11+](https://img.shields.io/badge/python-3.11%2B-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)

Python library for discovering remote log directories over SSH/SFTP and recording
their metadata in a shared SQLite manifest.

## Overview

`ssh-discovery` connects to a remote host over SSH/SFTP, scans a directory for new
log subdirectories, and records their metadata in a shared SQLite manifest. It does
not download file contents or parse them - those are the responsibilities of
separate components.

This is a library, not a standalone application. The consuming application is
responsible for:

- Scheduling (Task Scheduler, APScheduler, cron, etc.)
- Logging configuration (handlers, formatters, levels)
- Config object construction

---

## Architecture

```text
ssh_discovery/
|- __init__.py       Public API re-exports
|- service.py        DiscoveryService - main entry point
|- config.py         Typed config dataclasses
|- models.py         Shared domain models
|- discovery/        Anchor logic and file filtering
|- transport/        SSH/SFTP connectivity and remote file listing
|- persistence/      SQLite schema, connection lifecycle, repository
|- cleanup/          Retention policy enforcement
\- common/           Shared errors, datetime helpers, path utilities
```

### Data flow per `DiscoveryService.run()` call

1. Open SQLite database and apply schema.
2. Read current state and derive the anchor from the newest `(mtime, filename)` pair.
3. Open SSH, open SFTP, list the remote directory, then close SFTP and SSH.
4. Filter out folders that are not newer than the anchor or are already known.
5. Insert new file metadata rows with `INSERT OR IGNORE`.
6. Cleanup rows whose `parse_status` is `done` and whose `parsed_at` is older than
   `parsed_row_retention_days`. The anchor is never deleted.
7. Return a `DiscoveryResult`.

### The anchor

The anchor is the row with the newest remote `mtime`. If multiple rows share the
same `mtime`, the lexicographically largest folder name wins as a deterministic
tiebreaker. This makes discovery work even when folder names are not in a
chronologically sortable format such as `MM-DD-YYYY`.

---

## Project structure

```text
ssh-discovery/
|- src/
|  \- ssh_discovery/      The installable package
|- tests/
|  |- unit/               Isolated unit tests
|  \- integration/        Tests using a real temp SQLite DB
|- pyproject.toml
\- .github/workflows/ci.yml
```

The SQLite database file is external runtime state. Its path is set in
`DatabaseConfig.path`. It is shared with other components and should not live in the
source tree.

---

## Installation

```bash
pip install ssh-discovery
```

Or from source:

```bash
git clone <repo>
cd ssh-discovery
pip install -e ".[dev]"
```

Requirements: Python 3.11+ and Paramiko.

---

## Usage

```python
from ssh_discovery import (
    DatabaseConfig,
    DiscoveryConfig,
    DiscoveryService,
    SshConfig,
)

config = DiscoveryConfig(
    ssh=SshConfig(
        host="192.168.1.100",
        port=22,
        username="logsync",
        private_key_path="/path/to/id_ed25519",
        password=None,  # or key passphrase for encrypted private keys
        connect_timeout_seconds=10.0,
        keepalive_seconds=30,
    ),
    database=DatabaseConfig(
        path="/path/to/ssh-discovery.db",
        busy_timeout_ms=10_000,
    ),
    remote_log_dir="/var/log/mylogs",
    file_glob="app-*",
    parsed_row_retention_days=1,
)

service = DiscoveryService(config)
result = service.run()

print(f"Discovered: {result.discovered_count}")
print(f"Skipped: {result.skipped_count}")
print(f"Cleaned up: {result.cleaned_count}")
```

### Error handling

```python
from ssh_discovery import PersistenceError, SshDiscoveryError, TransportError

try:
    result = service.run()
except TransportError as exc:
    logger.error("Transport failure: %s", exc)
except PersistenceError as exc:
    logger.error("Database failure: %s", exc)
except SshDiscoveryError as exc:
    logger.error("Discovery failure: %s", exc)
```

### Logging

This package uses standard Python module loggers and does not configure handlers or
formatters. Configure logging in your application before calling `service.run()`.

```python
import logging

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(name)s: %(message)s",
)
```

---

## Configuration reference

### `SshConfig`

| Field | Type | Default | Description |
|---|---|---|---|
| `host` | `str` | - | IP or hostname of the remote host |
| `port` | `int` | `22` | SSH port |
| `username` | `str` | `"logsync"` | SSH username |
| `private_key_path` | `str \| None` | `None` | Path to private key file |
| `password` | `str \| None` | `None` | Password auth or private-key passphrase |
| `connect_timeout_seconds` | `float` | `30.0` | TCP, banner, and auth timeout |
| `keepalive_seconds` | `int` | `60` | SSH keepalive interval, `0` disables |
| `known_hosts_path` | `str \| None` | `None` | Known-hosts path for strict verification |
| `allow_unknown_hosts` | `bool` | `False` | Accept unknown host keys automatically |

At least one of `private_key_path` or `password` is required.
Unknown SSH hosts are rejected by default. Set `allow_unknown_hosts=True`
only in controlled environments where trust-on-first-use is acceptable.

### `DatabaseConfig`

| Field | Type | Default | Description |
|---|---|---|---|
| `path` | `str` | - | Absolute path to the SQLite database file |
| `busy_timeout_ms` | `int` | `5000` | SQLite busy timeout in milliseconds |

### `DiscoveryConfig`

| Field | Type | Default | Description |
|---|---|---|---|
| `ssh` | `SshConfig` | - | SSH connection settings |
| `database` | `DatabaseConfig` | - | SQLite settings |
| `remote_log_dir` | `str` | - | Remote directory to scan |
| `file_glob` | `str` | `"*"` | Glob pattern for matching subdirectory names |
| `parsed_row_retention_days` | `int` | `1` | Delete parsed rows older than N days |

---

## Running tests

```bash
pytest
pytest --cov=ssh_discovery --cov-report=term-missing
```

---

## Database schema

One table: `discovered_files`.

| Column | Type | Notes |
|---|---|---|
| `id` | `INTEGER PK` | Auto-increment |
| `filename` | `TEXT UNIQUE` | Bare folder name, unique id and anchor tiebreaker |
| `remote_path` | `TEXT` | Full remote path |
| `mtime` | `TEXT` | UTC ISO-8601 last-modified time |
| `discovered_at` | `TEXT` | UTC ISO-8601 insertion time |
| `download_status` | `TEXT` | Default `'pending'` |
| `parse_status` | `TEXT` | Default `'pending'` |
| `parsed_at` | `TEXT` | Nullable, set by parser |
| `last_error` | `TEXT` | Nullable error text |

WAL journal mode and a configurable busy timeout are applied on every connection
open.

---

## Notes

- Idempotent: re-running discovery against the same remote state inserts nothing.
- Anchor ordering is based on remote `mtime`, with filename as a tiebreaker.
- Concurrent access: WAL mode allows other components to read and write safely.
- No file downloads: this package records metadata only.
- Directory size is not stored.
- SSH keys are preferred over passwords for production deployments.
- Unknown SSH host keys are rejected by default unless explicitly allowed.
