Metadata-Version: 2.4
Name: deja-reve
Version: 0.1.0
Summary: Video Memory Layer: retrieval-augmented memory for video-native AI agents
Project-URL: Homepage, https://github.com/Natai-AI/deja-reve
Project-URL: Repository, https://github.com/Natai-AI/deja-reve
Project-URL: Issues, https://github.com/Natai-AI/deja-reve/issues
Author-email: Mir Sakib <sakib@natai.ai>
License-Expression: MIT
License-File: LICENSE
Keywords: memory,rag,retrieval,video,vision,vlm
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Multimedia :: Video
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.12
Requires-Dist: click>=8.1
Requires-Dist: librosa>=0.11.0
Requires-Dist: numpy>=1.26
Requires-Dist: opencv-python-headless>=4.13.0.92
Requires-Dist: pydantic>=2.0
Requires-Dist: rfdetr>=1.7.0
Requires-Dist: rich>=13.0
Requires-Dist: scenedetect>=0.6
Requires-Dist: soundfile>=0.13.1
Requires-Dist: sqlite-vec>=0.1.9
Requires-Dist: supervision>=0.28.0
Requires-Dist: torch>=2.0
Requires-Dist: torchvision
Requires-Dist: trackers>=2.4.0
Provides-Extra: all
Requires-Dist: librosa; extra == 'all'
Requires-Dist: rfdetr>=1.7; extra == 'all'
Requires-Dist: scenedetect[opencv]>=0.6; extra == 'all'
Requires-Dist: soundfile; extra == 'all'
Requires-Dist: sqlite-vec>=0.1; extra == 'all'
Requires-Dist: supervision>=0.25; extra == 'all'
Requires-Dist: torch>=2.0; extra == 'all'
Requires-Dist: torchvision; extra == 'all'
Requires-Dist: trackers>=2.1; extra == 'all'
Requires-Dist: transformers>=4.40; extra == 'all'
Provides-Extra: audio
Requires-Dist: librosa; extra == 'audio'
Requires-Dist: soundfile; extra == 'audio'
Requires-Dist: transformers>=4.40; extra == 'audio'
Provides-Extra: db
Requires-Dist: sqlite-vec>=0.1; extra == 'db'
Provides-Extra: dev
Requires-Dist: librosa; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: rfdetr>=1.7; extra == 'dev'
Requires-Dist: scenedetect[opencv]>=0.6; extra == 'dev'
Requires-Dist: soundfile; extra == 'dev'
Requires-Dist: sqlite-vec>=0.1; extra == 'dev'
Requires-Dist: supervision>=0.25; extra == 'dev'
Requires-Dist: torch>=2.0; extra == 'dev'
Requires-Dist: torchvision; extra == 'dev'
Requires-Dist: trackers>=2.1; extra == 'dev'
Requires-Dist: transformers>=4.40; extra == 'dev'
Provides-Extra: vision
Requires-Dist: rfdetr>=1.7; extra == 'vision'
Requires-Dist: supervision>=0.25; extra == 'vision'
Requires-Dist: torch>=2.0; extra == 'vision'
Requires-Dist: torchvision; extra == 'vision'
Requires-Dist: trackers>=2.1; extra == 'vision'
Description-Content-Type: text/markdown

# deja-reve

**Video Memory Layer** — retrieval-augmented memory for video-native AI agents.

deja-reve ingests a video and builds a local, queryable memory database: scene-segmented shots, object detections & tracks, visual embeddings, and speech transcription. Everything runs **fully offline** — no external API services.

## Features

- **Shot detection** — segments video into shots via [PySceneDetect](https://www.scenedetect.com/).
- **Object detection & tracking** — [RF-DETR](https://github.com/roboflow/rf-detr) detections tracked across frames with ByteTrack.
- **Visual embeddings** — 384-dim DINOv2 features for similarity search.
- **Speech transcription** — Moonshine ASR, with a silence gate and hallucination guard.
- **Vector + structured queries** — SQLite with [`sqlite-vec`](https://github.com/asg017/sqlite-vec).

## Prerequisites

- Python **3.12+**
- `ffmpeg` on your PATH (for audio extraction)

## Installation

```bash
pip install deja-reve
```

Or with [uv](https://docs.astral.sh/uv/):

```bash
uv add deja-reve
```

This pulls all dependencies including PyTorch, RF-DETR, and the ASR model. The first `ingest` run downloads model weights on demand (~400 MB for RF-DETR, ~300 MB for Moonshine).

### From source

```bash
git clone https://github.com/Natai-AI/deja-reve.git
cd deja-reve
uv sync
```

## How it works

A VLM has no memory of a video between calls and a limited context window — you
can't just stream every frame to it. deja-reve solves this the way RAG solved it
for text: **process the video once into a structured, queryable memory, then at
question-time retrieve only the relevant shots and hand the model a compact text
context block** instead of raw video.

```
ingest (once)        query-time
  video ─▶ memory.db ─▶ retrieve relevant shots ─▶ VIDEO CONTEXT block ─▶ VLM
```

The retrieved context looks like this — a few hundred tokens carrying timestamps,
objects with bounding boxes, stable track IDs, transcript, and neighbor shots:

```
VIDEO CONTEXT:
- Shot #1: 4.0s - 8.0s (duration: 4.0s)
- Objects in frame: person (t0), chair (t1), person (t3), dining table (t-1)
  - t0 at [536, 408, 757, 923]
- Visual: bottle, bowl, chair, cup, dining table, person, wine glass
- Previous shot: #0 (0.0s - 4.0s)
- Next shot: #2 (8.0s - 10.0s)
```

## Usage

### 1. Ingest a video (once)

```bash
deja-reve ingest path/to/video.mp4 --db video.db
```

If `--db` is omitted, it defaults to `<video_name>.db`.

### 2. Query from the CLI

```bash
# A specific shot
deja-reve query --db video.db --shot 1

# A time range (seconds)
deja-reve query --db video.db --time 4.0-8.0

# Shots visually similar to a given shot (normalized cosine similarity)
deja-reve query --db video.db --similar-to 1

# Visual mood shifts between shots
deja-reve query --db video.db --mood-shifts

# Full-text search over transcripts
deja-reve query --db video.db "what did they say about dinner"
```

### 3. Use it as a VLM memory layer (Python)

The core workflow: **retrieve → format → inject into the prompt.**

```python
from deja_reve.db import MemoryDB
from deja_reve.retrieval import RetrievalEngine
from deja_reve.prompt import format_shot_context

db = MemoryDB("video.db")
engine = RetrievalEngine(db)

# Retrieve the relevant slice (here: around the 5s mark)
result = engine.query_time_range(4.0, 6.0)

# Format it into a VIDEO CONTEXT block
context = format_shot_context(result)

# Inject as ground truth, then ask your VLM
prompt = f"""{context}

Use the VIDEO CONTEXT above as ground-truth observations from the video.
Q: What's happening at the dinner table around the 5-second mark?
"""
answer = your_vlm(prompt)   # optionally also pass the frame image
```

Pick the retrieval method that matches the question's intent:

| Intent | Method |
|---|---|
| A time / "around 5s" | `engine.query_time_range(start, end)` |
| A specific shot | `engine.query_by_shot_id(shot_id)` |
| "find similar scenes" | `engine.query_similar_shots_by_id(shot_id)` |
| "where the scene changes" | `engine.find_mood_shifts()` |
| "what did they say about X" | `engine.search_transcripts(text)` |

For multi-shot answers, use `format_multi_shot_context(results)` to join several
shot blocks with `---` separators.

### 4. Edit video with a VLM (memory → decisions → ffmpeg)

deja-reve doesn't cut video — it lets a VLM decide *what* to cut and *where*. Feed
the context blocks to the model, have it emit an edit decision list (JSON of shot
IDs, timestamps, and crop/blur boxes drawn from the stored bounding boxes), then
run those decisions through `ffmpeg`. Every decision references real timestamps and
boxes from the DB, so the edit is precise and reproducible rather than guessed.

## Development

Run the test suite:

```bash
uv run pytest
```

## Project layout

```
src/deja_reve/
  cli.py         CLI entry point (ingest / query)
  ingest.py      End-to-end ingest pipeline
  shots.py       Scene/shot detection
  vision.py      RF-DETR detection + DINOv2 embeddings + tracking
  audio.py       Speech transcription
  retrieval.py   Query engine
  db.py          SQLite + sqlite-vec storage
  models.py      Pydantic data models
  prompt.py      Context formatting for agents
tests/           Test suite
```

## License

[MIT](LICENSE)
