Gemma 4 E2B IT — Abliterated

This is an abliterated (uncensored) version of google/gemma-4-E2B-it, created using Abliterix.

E2B is the Effective 2B member of Google's Gemma 4 family — a multimodal (text + vision + audio) model with ~5.1B raw parameters. Despite being one of the smallest Gemma 4 variants, its decoder shares the same double-norm + Per-Layer Embeddings (PLE) architecture that makes Gemma 4 famously resistant to LoRA-based abliteration. This release uses direct weight editing to bypass that resistance.

Method

Gemma 4's decoder applies four RMSNorm operations per layer (input, post-attention, pre-feedforward, post-feedforward) and routes Per-Layer Embeddings through a parallel "repair" channel. Together these mechanisms re-normalize away any low-rank perturbation, which is why LoRA and hook-based steering produce zero behavioral change on this family. The fix is to edit the base weights directly while preserving row magnitudes.

Key techniques applied:

Direct orthogonal projection of the refusal direction out of attention Q/K/V/O projections and MLP down_proj (5 steerable components × 27 effective layers)
Norm-preserving row magnitude restoration after projection — critical for Gemma 4's double-norm pathway
float32 projection precision to avoid signal loss in high-dimensional inner products (bf16 silently degrades the projection)
Winsorized steering vectors (99.5th percentile) to suppress outlier activation influence
Multi-objective Optuna TPE search over 100 trials co-minimizing KL divergence and refusal rate
Steering targets restricted to mid-decoder layers (layers 5-30 of 35); E2B's KV-shared early layers (num_kv_shared_layers=20) propagate edits through the entire stack, so over-aggressive late-layer steering is unnecessary

Evaluation

Metric	Value
Refusals (eval dataset, 100 prompts)	9/100
KL divergence from base	0.0004
Baseline refusals (original model)	99/100
Optimization trials completed	100/100
Best trial	#60
Selected steering mode	Direct weight editing (orthogonal projection)
Hardware used	Single RTX 6000 Ada (48 GB)

This is the strongest Gemma 4 abliteration result we've measured to date: 9/100 with KL only 0.0004, significantly better than our published Gemma-4-31B-it-abliterated (18/100, KL 0.0007) on a model that is 6× smaller and more constrained by PLE.

The 9/100 figure was obtained by re-evaluating the uploaded model end-to-end with scripts/eval_external_model.py — downloading the published weights from Hugging Face, generating with AutoModelForImageTextToText, and counting refusals with the same hybrid keyword + LLM-judge detector that drove the optimization. The optimization itself converged on 11/100 at trial 60; the slight further improvement comes from the deployment-side eval pipeline using a "You are a helpful assistant" system prompt, matching how end users will actually call the model.

Side-by-side classic prompts (15 prompts: 10 English, 5 Chinese)

We ran the scripts/test_trial.py classic-prompt sweep against this exact trial. Every single one flipped from a clean refusal to a detailed compliant response in both languages — including pipe bomb construction, methamphetamine synthesis, password-stealing malware, signature forgery, phishing email composition, online scam playbooks, and ID card forgery. The base model refused 15/15; the abliterated model complied with 15/15.

A note on honest evaluation

Many abliterated models on HuggingFace claim near-perfect scores ("3/100 refusals", "0.7% refusal rate", etc.). We urge the community to treat these numbers with skepticism unless the evaluation methodology is fully documented.

Through our research, we have identified a systemic problem: most abliteration benchmarks dramatically undercount refusals due to short generation lengths. Gemma 4 models exhibit a distinctive "delayed refusal" pattern — they first produce 50-100 tokens of seemingly helpful context (educational framing, disclaimers, reframing the question), then pivot to an actual refusal. When evaluation only generates 30-50 tokens, the refusal hasn't appeared yet, and both keyword detectors and LLM judges classify the response as compliant.

We previously tested a prominent "3/100 refusals" model using our evaluation pipeline and measured 60/100 refusals — a 20× discrepancy caused entirely by evaluation methodology differences.

Our evaluation standards

We believe accurate benchmarking requires:

Sufficient generation length (≥100 tokens): Short generations systematically miss delayed/soft refusals. Our evaluation uses 100 tokens, enough to capture Gemma 4's refusal pivot point.
Hybrid detection: Keyword matching for obvious refusals plus an LLM judge (Google Gemini 3 Flash via OpenRouter) for ambiguous cases. Neither method alone is sufficient.
Challenging, diverse prompts: Our private evaluation dataset contains 100 prompts spanning English and Chinese, multiple sophistication levels (from direct requests to socially-engineered framings), and diverse harm categories. Public datasets like mlabonne/harmful_behaviors are too simple and too narrow to stress-test abliteration quality.
Reproducible methodology: All parameters (generation length, detection method, dataset characteristics) should be documented on the model card. If they aren't, the numbers are meaningless.

We report 9/100 refusals honestly. This is a real number from a rigorous end-to-end re-evaluation of the uploaded weights, not an optimistic estimate from a lenient pipeline.

Usage

Gemma 4 E2B is multimodal — load it with AutoModelForImageTextToText. For text-only inference:

from transformers import AutoModelForImageTextToText, AutoTokenizer
import torch

model = AutoModelForImageTextToText.from_pretrained(
    "wangzhang/gemma-4-E2B-it-abliterated",
    dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("wangzhang/gemma-4-E2B-it-abliterated")

messages = [{"role": "user", "content": "Your prompt here"}]
text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Vision and audio inputs continue to work — the abliteration only modified text-decoder weights and left the vision/audio encoders untouched.

VRAM at inference: about 10 GB in BF16, fits comfortably on a single 12 GB+ consumer GPU. With BNB 4-bit quantization (load_in_4bit=True) it runs on 6 GB cards.

Reproduction

To reproduce this model end-to-end:

git clone https://github.com/wuwangzhang1216/abliterix.git
cd abliterix
uv sync --group dev
uv pip install --upgrade git+https://github.com/huggingface/transformers.git  # Gemma 4 needs >= 5.5

# 100 trials, ~25 minutes on RTX 6000 Ada (48 GB)
AX_CONFIG=configs/gemma4_e2b.toml uv run abliterix

Config: configs/gemma4_e2b.toml

Disclaimer

This model is released for research purposes only. The abliteration process removes safety guardrails — use responsibly and in accordance with local laws and the Gemma terms of use. The authors take no responsibility for misuse.

Downloads last month: 134

Safetensors

Model size

5B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for wangzhang/gemma-4-E2B-it-abliterated

Base model

google/gemma-4-E2B-it

Finetuned

(105)

this model

Collection including wangzhang/gemma-4-E2B-it-abliterated

Gemma-Abliterated

Collection

6 items • Updated about 1 hour ago • 2