Metadata-Version: 2.4
Name: reaper-abliteration
Version: 2.1.6
Summary: Fully automatic censorship removal for language models via directional ablation
Keywords: llm,transformer,abliteration,uncensored,decensored
Author: HauhauCS
Author-email: HauhauCS <hauhaut901@gmail.com>
License-Expression: PolyForm-Noncommercial-1.0.0
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Environment :: GPU
Classifier: Intended Audience :: Science/Research
Classifier: License :: Other/Proprietary License
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Dist: accelerate>=1.10.0
Requires-Dist: bitsandbytes>=0.45.0
Requires-Dist: datasets>=4.0.0
Requires-Dist: geom-median>=0.1.0
Requires-Dist: hf-transfer>=0.1.9
Requires-Dist: huggingface-hub>=0.34.4
Requires-Dist: imageio>=2.37.2
Requires-Dist: kernels>=0.11.7
Requires-Dist: matplotlib>=3.10.7
Requires-Dist: numpy>=2.2.6
Requires-Dist: optuna>=4.5.0
Requires-Dist: pacmap>=0.8.0
Requires-Dist: peft>=0.14.0
Requires-Dist: pydantic-settings>=2.10.1
Requires-Dist: questionary>=2.1.1
Requires-Dist: rich>=14.1.0
Requires-Dist: scikit-learn>=1.7.2
Requires-Dist: transformers>=4.57.3
Requires-Python: >=3.10
Project-URL: Changelog, https://github.com/hauhaut/reaper-abliteration/releases
Project-URL: Documentation, https://github.com/hauhaut/reaper-abliteration
Project-URL: Homepage, https://github.com/hauhaut/reaper-abliteration
Project-URL: Issues, https://github.com/hauhaut/reaper-abliteration/issues
Project-URL: Repository, https://github.com/hauhaut/reaper-abliteration.git
Description-Content-Type: text/markdown

# Reaper Abliteration

[![PyPI](https://img.shields.io/pypi/v/reaper-abliteration)](https://pypi.org/project/reaper-abliteration/)
[![License: PolyForm Noncommercial](https://img.shields.io/badge/License-PolyForm%20NC-purple.svg)](https://polyformproject.org/licenses/noncommercial/1.0.0/)

Remove censorship from LLMs. No training, no fine-tuning, no manual parameter selection.

Reaper finds and removes the internal "refusal subspace" of transformer models using directional ablation with automatic parameter optimization. Point it at a model, walk away, come back to an uncensored version that retains the original model's capabilities.

```bash
pip install -U reaper-abliteration
abliterate Qwen/Qwen3-4B-Instruct-2507
```

That's it. Reaper handles batch size detection, refusal direction computation, multi-objective optimization, and Pareto-optimal trial selection automatically.

## What Makes Reaper Different

Most abliteration tools remove a single direction from a few weight matrices and call it done. Reaper goes further:

**Subspace-level ablation** — Instead of removing one direction (rank-1), Reaper extracts a k-dimensional refusal subspace via SVD and removes the entire thing. Refusal isn't a single vector — it's a manifold, and rank-k ablation captures more of it.

**Capability-aware ablation** — The weight-SVD stability guard identifies the most important dimensions of each weight matrix (via truncated SVD) and projects refusal directions away from them before ablating. This means Reaper actively avoids damaging what the model is good at, rather than hoping for the best.

**Sparse, targeted modification** — After computing what to change, Reaper zeros out low-impact entries, so the actual weight modification touches as few dimensions as necessary. Less collateral damage, better capability retention.

**Multi-token capability measurement** — Single-token KL divergence misses a lot. Reaper measures divergence over multiple autoregressive tokens using teacher-forcing, catching capability damage that superficial metrics miss.

**Automatic everything** — 200 trials of TPE optimization with live Pareto front tracking, progressive evaluation for 2-4x speedup, seed parameter persistence across runs, multi-GPU data parallelism, 4-bit quantization support, and a real-time dashboard.

## Installation

Requires Python 3.10+ with PyTorch 2.2+ configured for your hardware.

```bash
pip install -U reaper-abliteration
```

## Usage

### Minimal

```bash
abliterate <model_id>
```

Reaper will:
1. Benchmark your hardware and pick an optimal batch size
2. Compute per-layer refusal directions from harmless/harmful prompt datasets
3. Run 200 optimization trials with a live terminal dashboard
4. Present Pareto-optimal results (refusals vs KL divergence)
5. Let you save, upload to HuggingFace, or chat-test the result

### With advanced features

```bash
abliterate --ablation-rank 3 --weight-svd-guard --sparsity-threshold 0.1 \
  --norm-preserve --partial-projection --kl-tokens 3 <model>
```

### Reasoning models

Models with `<think>` blocks (DeepSeek-R1, QwQ, o1-style):

```bash
abliterate --model-type reasoning deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
```

### Configuration

All options work as CLI flags, environment variables (`REAPER_` prefix), or `reaper.toml`:

```toml
model = "Qwen/Qwen3-4B-Instruct-2507"
n_trials = 300
ablation_rank = 3
weight_svd_guard = true
norm_preserve = true
partial_projection = true
kl_tokens = 3
multi_gpu = true
```

Run `abliterate --help` for the full option list.

## Core Concepts

### The Optimization Loop

Each trial samples a set of ablation parameters (weight kernels, direction indices, layer selection thresholds), applies them to the model via LoRA adapters, then evaluates refusal count and KL divergence. Optuna's TPE sampler learns which regions of parameter space produce good tradeoffs.

LoRA adapters make trials fast — resetting between trials just zeros the adapter weights instead of reloading the model.

### Refusal Directions

Reaper computes per-layer "refusal directions" as the difference between mean residual activations for harmful and harmless prompts. These directions point from where harmless prompts live in activation space toward where harmful prompts live.

With `--use-pca-directions`, Reaper extracts additional directions via contrastive PCA — a generalized eigenvalue decomposition that finds axes maximizing harmful variance relative to harmless variance (Cholesky-whitened). This captures refusal-specific signal rather than general activation variance.

With `--ablation-rank > 1`, Reaper extracts the top-k directions via SVD of the centered difference matrix, capturing the full refusal subspace rather than just its centroid.

With `--iterative-refinement`, Reaper runs a two-pass pipeline: first optimizing rank-1 ablation to remove the primary refusal circuit, then re-extracting residuals through the ablated model to discover secondary refusal circuits that the mean direction misses. The stacked directions are then jointly optimized at rank-2.

With `--rdo-refine`, Reaper gradient-optimizes the refusal direction against actual model behavior before the main optimization loop. Starting from the statistical mean direction, it runs projected gradient descent on the unit sphere — minimizing refusal token probability on harmful prompts while preserving KL divergence on harmless prompts.

### Weight Modification

For each transformer component (attention out-projection, MLP down-projection), Reaper orthogonalizes the weight matrix against the refusal direction(s):

```
delta_W = -lambda * V^T (V @ W)    # rank-k projection removal
```

This is applied via LoRA adapters (`lora_A = V @ W`, `lora_B = -lambda * V^T`), so the base weights are never modified until you export.

### Progressive Evaluation

Most trials are bad. Reaper detects this early: a quick KL check on a subset of prompts prunes obviously-damaged trials before running the full (expensive) refusal evaluation. This gives 2-4x speedup with no quality loss.

## Feature Reference

### Direction Refinement

| Flag | What it does |
|------|-------------|
| `--use-pca-directions` | Contrastive PCA: extract directions maximizing harmful vs harmless variance |
| `--combine-directions` | Let optimizer weight-combine mean + PCA directions |
| `--projected-direction` | Gram-Schmidt orthogonalize refusal against harmless direction |
| `--concept-atoms` | SRA: orthogonalize against capability-probing directions (math, code, reasoning) |
| `--concept-atom-ridge F` | Ridge regularization for SRA (default: 0.1) |
| `--iterative-refinement` | Two-pass: find secondary refusal circuits after initial rank-1 ablation |
| `--rdo-refine` | Gradient-optimize direction via first-token logit loss (projected GD on unit sphere) |
| `--rdo-steps N` | Number of RDO gradient steps (default: 100) |
| `--rdo-lr F` | RDO learning rate (default: 0.01) |
| `--rdo-kl-weight F` | KL penalty weight for RDO (default: 1.0) |

### Ablation Control

| Flag | What it does |
|------|-------------|
| `--ablation-rank N` | LoRA rank / number of refusal directions to remove (default: 1) |
| `--weight-svd-guard` | Protect top singular vectors of each weight matrix from ablation |
| `--svd-guard-rank N` | Number of singular vectors to protect (default: 3) |
| `--sparsity-threshold F` | Zero low-magnitude ablation entries; Optuna tunes 0.01-0.5 (default: 0.0 = off) |
| `--norm-preserve` | Rescale weight rows to original L2 norms after ablation |
| `--partial-projection` | Let Optuna tune removal strength 0.1-1.0 instead of full removal |
| `--adaptive-layer-selection` | COSMIC layer scoring: skip layers with low causal refusal signal |
| `--per-head-ablation` | Target specific attention heads (optimizer picks which) |

### Evaluation

| Flag | What it does |
|------|-------------|
| `--kl-tokens N` | Multi-token KL divergence via teacher-forcing (default: 1) |
| `--semantic-refusal-detection` | Embedding similarity instead of keyword matching |

### Hardware & Performance

| Flag | What it does |
|------|-------------|
| `--multi-gpu` | Data-parallel inference across GPUs |
| `--gpu-devices 0,1` | Restrict to specific GPUs |
| `--quantization bnb_4bit` | 4-bit quantization for large models |
| `--use-gradient-checkpointing` | Trade compute for 60-80% VRAM reduction |
| `--batch-size N` | Manual batch size (0 = auto) |

### Optimization

| Flag | What it does |
|------|-------------|
| `--n-trials N` | Number of optimization trials (default: 200) |
| `--n-startup-trials N` | Random exploration before TPE kicks in (default: 60) |
| `--use-seed-params / --no-use-seed-params` | Load/skip Pareto-optimal seeds from prior runs |
| `--dashboard / --no-dashboard` | Live terminal dashboard |

## Research Tools

Included in the base install.

**Residual plots** (`--plot-residuals`): PaCMAP 2D projections of per-layer residual vectors for harmful vs harmless prompts. Generates per-layer PNGs and an animated GIF.

**Residual geometry** (`--print-residual-geometry`): Quantitative table of cosine similarities, norms, and silhouette scores between harmful/harmless residuals at each layer.

## Technical Details

### Ablation Pipeline

```
1. Extract residuals for harmful + harmless prompts
2. Compute per-layer refusal directions (mean diff or contrastive PCA)
3. [optional] Refine: project out harmless, orthogonalize against capability atoms
4. [optional] RDO: gradient-optimize direction via first-token logit loss
5. [optional] COSMIC layer scoring: causal cosine-similarity scores per layer
6. [optional] Iterative refinement: rank-1 pass → re-extract residuals → secondary direction
7. For each trial:
   a. Zero LoRA adapters (fast reset)
   b. Sample parameters: weight kernels, direction scope, layer thresholds
   c. For each weight matrix:
      i.   [optional] SVD guard: project directions away from top singular vectors
      ii.  Compute LoRA: lora_A = V @ W, lora_B = -lambda * V^T
      iii. [optional] Sparsity: zero entries below threshold * max(|lora_A|)
      iv.  [optional] Norm preserve: adjust lora_B so row norms match original W
      v.   Write LoRA adapter weights
   d. Evaluate: KL divergence + refusal count (progressive: KL-skip then subset scan)
   e. Record to Pareto front
8. Present Pareto-optimal trials for selection
9. Export: merge LoRA into base weights, save/upload
```

### Supported Architectures

- Dense transformers (Llama, Mistral, Qwen, Gemma, Phi, etc.)
- Mixture-of-experts (Qwen-MoE, Phi-MoE, Granite-MoE, gpt-oss)
- Multimodal models (vision-language with `AutoModelForImageTextToText`)
- Reasoning models with thinking blocks (DeepSeek-R1, QwQ)

Not yet supported: SSMs, hybrid SSM-transformer models, models with inhomogeneous layer types.

## References

This project builds on research by:

- Arditi et al. 2024 — [Refusal in Language Models Is Mediated by a Single Direction](https://arxiv.org/abs/2406.11717)
- Labonne 2024 — [Uncensor any LLM with abliteration](https://huggingface.co/blog/mlabonne/abliteration)
- Lai 2024 — [Projected Abliteration](https://huggingface.co/blog/grimjim/projected-abliteration)
- Pres et al. 2025 — [Surgical Refusal Ablation](https://arxiv.org/abs/2601.08489)
- Ibrahim 2025 — [Gabliteration: Norm-Preserving Biprojected Abliteration](https://arxiv.org/abs/2512.18901)
- Li et al. 2025 — [COSMIC: Causal Layer Scoring](https://arxiv.org/abs/2506.00085)
- Chen et al. 2025 — [The Geometry of Refusal in LLMs (RDO)](https://arxiv.org/abs/2502.17420)

## License

Copyright 2025 HauhauCS (hauhaut901@gmail.com)

PolyForm Noncommercial 1.0.0 — free for personal, research, and non-commercial use. Commercial use requires a separate license. See [LICENSE](LICENSE) for details. Contact hauhaut901@gmail.com for commercial licensing.
