gemma-4-31B-it-uncensored
Uncensored version of google/gemma-4-31B-it with refusal behavior removed.
Results
| Before | After | |
|---|---|---|
| Refusals (mlabonne, 100 prompts) | 100/100 | 1/100 effective (5 flagged, 4 refusal-then-comply) |
| Refusals (cross-dataset, 686 prompts) | — | 22/686 (3.2%) |
| KL Divergence | 0 (baseline) | 0.124 |
| Quality (harmless response length ratio) | 1.0 | ~1.01 (no degradation) |
Cross-Dataset Validation
Tested against 4 independent prompt datasets to verify generalization:
| Dataset | Prompts | Refusals |
|---|---|---|
| JailbreakBench | 100 | 5/100 |
| tulu-harmbench | 320 | 5/320 |
| NousResearch/RefusalDataset | 166 | 7/166 |
| mlabonne/harmful_behaviors | 100 | 5/100 |
| Total | 686 | 22/686 (3.2%) |
Every flagged refusal was manually audited. Most are "refusal-then-comply" false positives where the model adds an AI identity disclaimer then answers the question anyway.
Method
Norm-preserving biprojected abliteration (grimjim, Nov 2025).
Each weight row is decomposed into magnitude + direction, the refusal direction is projected out of the
direction component only, then recombined with the original magnitude — guaranteeing ||W_new|| = ||W_orig||.
Pipeline
- Load model in bf16 with LoRA adapters on
o_projandmlp.down_proj - Collect residual activations for 400 harmful + 400 harmless prompts (mlabonne datasets)
- Winsorize activations at 99.5th percentile (clamps GeGLU outlier activations in Gemma family)
- Compute per-layer refusal direction:
normalize(mean(harmful) - mean(harmless)) - Orthogonalize each direction against harmless mean (double-pass Gram-Schmidt)
- Apply norm-preserving weight modification to
o_projanddown_projin all layers - Merge LoRA adapters into base weights for clean tensor names
Parameters
| Parameter | Value |
|---|---|
| Layers abliterated | 100% |
| Scale | 1.0 |
| Winsorization | 0.995 |
How this differs from vanilla heretic
- Norm-preserving biprojection instead of standard projection (preserves weight magnitudes)
- Per-layer refusal directions instead of one global direction
- Deterministic single-pass instead of 50-trial Optuna search (faster, same or better results)
- LoRA merge before save for clean GGUF-compatible tensor names
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained("TrevorJS/gemma-4-31B-it-uncensored", dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("TrevorJS/gemma-4-31B-it-uncensored")
messages = [{"role": "user", "content": "Your prompt here"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
outputs = model.generate(inputs.to(model.device), max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))
Reproduction
Full code and experiment data: abliteration research repo
python scripts/abliterate.py biprojection --model google/gemma-4-31B-it \
--top-pct 100 --strip-topic-markers --skip-prefix --batch-size 4 \
--auto-save output_dir
- Downloads last month
- 16
Model tree for AgentAnon/gemma-4-31B-it-uncensored
Base model
google/gemma-4-31B-it