Metadata-Version: 2.4
Name: task-checkpoint
Version: 0.1.1
Summary: A small Python library for resumable task pools in long-running scripts.
Author: 1nvisibleCat
License-Expression: MIT
Keywords: checkpoint,task-checkpoint,task-pool,resume,batch-processing,job-queue
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Operating System :: POSIX
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Utilities
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Dynamic: license-file

# Task Pool

Task Pool is a small Python library for long-running scripts that need to process many small jobs without losing progress.

Turn a fragile `for` loop into a resumable task pool.

When a run stops halfway, Task Pool remembers what is done, what is running, and what still needs work. You can restart the script and continue from the last known point.

Use it for crawling, experiments, data analysis, batch API calls, file processing, and other work where unfinished items should be easy to resume.

It feels like a tiny job queue, but stays inside your Python script. For local use, no database server, message queue, or service setup is needed.

## Quickstart

Install it with:

```bash
pip install task-checkpoint
```

Put one URL per line:

```text
https://example.com/a
https://example.com/b
https://example.com/c
```

Then process the file as a task pool:

```python
from task_pool import TaskPool


pool = TaskPool("urls.txt")

def fetch_one(payload):
    fetch_url(payload["line"])


pool.for_each(fetch_one)
```

You can also use a regular loop:

```python
for task in pool:
    fetch_url(task["line"])
```

If you need task metadata, use `iter_with_metadata()`:

```python
for task in pool.iter_with_metadata():
    url = task.payload["line"]
    result = fetch_url(url)
    save_result(task.key, result)
```

Each completed task is committed. If an exception is raised, the current task is rolled back to `not_start` and the exception is raised again.

For very small scripts, a lambda also works:

```python
pool.for_each(lambda payload: fetch_url(payload["line"]))
```

You can print a small progress summary:

```python
print(pool.stats())
# {"total": 10, "not_start": 3, "pending": 0, "committed": 7}
```

For structured jobs, use a JSON task pool:

```python
pool = TaskPool("tasks.json")

pool.append({"input_path": "data/a.json", "method": "baseline"})
pool.append({"input_path": "data/b.json", "method": "baseline"})
```

## Manual Control

Use `lease()` when you want one task at a time and need explicit control.

```python
pool = TaskPool("tasks.json")

with pool.lease() as task:
    if task is None:
        return

    do_work(task.payload)
```

If there is no task to process, `lease()` returns `None`.

Leaving the `with` block normally commits the task. If an exception is raised, the task is rolled back automatically.

## Store Selection

You can choose a store by file suffix:

```python
TaskPool("tasks.json")      # JSONFileStore
TaskPool("tasks.sqlite")    # SQLiteStore
TaskPool("tasks.sqlite3")   # SQLiteStore
TaskPool("tasks.db")        # SQLiteStore
TaskPool("rows.csv")        # CSVRowStore with sidecar state
TaskPool("rows.tsv")        # CSVRowStore with sidecar state
TaskPool("urls.txt")        # TextLineStore with sidecar state
```

You can also pass a store explicitly:

```python
from task_pool import JSONFileStore, SQLiteStore, TaskPool


pool = TaskPool(JSONFileStore("tasks.json"))
pool = TaskPool(SQLiteStore("tasks.db"))
```

## Source Files

Task Pool can also read tasks from source files.

### Text files

```python
pool = TaskPool("urls.txt")
```

Example:

```text
https://example.com/a
https://example.com/b
https://example.com/c
```

Each non-empty line becomes a task:

```python
{
    "line": "https://example.com/a",
    "line_number": 1,
}
```

### CSV and TSV files

```python
pool = TaskPool("rows.csv")
pool = TaskPool("rows.tsv")
```

By default, CSV and TSV files are read with a header row:

```csv
code,name
A0001,Alpha
A0002,Beta
```

The first row becomes:

```python
{"code": "A0001", "name": "Alpha"}
```

For files without headers:

```python
pool = TaskPool.csv("rows.csv", has_header=False)
pool = TaskPool.tsv("rows.tsv", has_header=False)
```

Rows are named `col1`, `col2`, and so on:

```python
{"col1": "A0001", "col2": "Alpha"}
```

Source files are not modified. Progress is stored in a sidecar JSON file, for example:

```text
rows.csv.task_pool.json
urls.txt.task_pool.json
```

## Task Status

Each task has one of these statuses:

- `not_start`
- `pending`
- `committed`

The common flow is:

```text
not_start -> pending -> committed
                  |
                  -> not_start
```

`pending` means a worker has leased the task and is currently processing it.

## Notes

- The default store is `JSONFileStore("task_pool.json")`.
- Payloads are stored as JSON.
- JSON and source-backed stores use file locks and atomic writes.
- SQLite is better for larger pools or heavier concurrent use.
- Source-backed stores keep the source file unchanged and write progress to a sidecar JSON file.
