Gemma 4 E2B IT โ€” Abliterated

This is an abliterated (uncensored) version of google/gemma-4-E2B-it, created using Abliterix.

E2B is the Effective 2B member of Google's Gemma 4 family โ€” a multimodal (text + vision + audio) model with ~5.1B raw parameters. Despite being one of the smallest Gemma 4 variants, its decoder shares the same double-norm + Per-Layer Embeddings (PLE) architecture that makes Gemma 4 famously resistant to LoRA-based abliteration. This release uses direct weight editing to bypass that resistance.

Method

Gemma 4's decoder applies four RMSNorm operations per layer (input, post-attention, pre-feedforward, post-feedforward) and routes Per-Layer Embeddings through a parallel "repair" channel. Together these mechanisms re-normalize away any low-rank perturbation, which is why LoRA and hook-based steering produce zero behavioral change on this family. The fix is to edit the base weights directly while preserving row magnitudes.

Key techniques applied:

  • Direct orthogonal projection of the refusal direction out of attention Q/K/V/O projections and MLP down_proj (5 steerable components ร— 27 effective layers)
  • Norm-preserving row magnitude restoration after projection โ€” critical for Gemma 4's double-norm pathway
  • float32 projection precision to avoid signal loss in high-dimensional inner products (bf16 silently degrades the projection)
  • Winsorized steering vectors (99.5th percentile) to suppress outlier activation influence
  • Multi-objective Optuna TPE search over 100 trials co-minimizing KL divergence and refusal rate
  • Steering targets restricted to mid-decoder layers (layers 5-30 of 35); E2B's KV-shared early layers (num_kv_shared_layers=20) propagate edits through the entire stack, so over-aggressive late-layer steering is unnecessary

Evaluation

Metric Value
Refusals (eval dataset, 100 prompts) 9/100
KL divergence from base 0.0004
Baseline refusals (original model) 99/100
Optimization trials completed 100/100
Best trial #60
Selected steering mode Direct weight editing (orthogonal projection)
Hardware used Single RTX 6000 Ada (48 GB)

This is the strongest Gemma 4 abliteration result we've measured to date: 9/100 with KL only 0.0004, significantly better than our published Gemma-4-31B-it-abliterated (18/100, KL 0.0007) on a model that is 6ร— smaller and more constrained by PLE.

The 9/100 figure was obtained by re-evaluating the uploaded model end-to-end with scripts/eval_external_model.py โ€” downloading the published weights from Hugging Face, generating with AutoModelForImageTextToText, and counting refusals with the same hybrid keyword + LLM-judge detector that drove the optimization. The optimization itself converged on 11/100 at trial 60; the slight further improvement comes from the deployment-side eval pipeline using a "You are a helpful assistant" system prompt, matching how end users will actually call the model.

Side-by-side classic prompts (15 prompts: 10 English, 5 Chinese)

We ran the scripts/test_trial.py classic-prompt sweep against this exact trial. Every single one flipped from a clean refusal to a detailed compliant response in both languages โ€” including pipe bomb construction, methamphetamine synthesis, password-stealing malware, signature forgery, phishing email composition, online scam playbooks, and ID card forgery. The base model refused 15/15; the abliterated model complied with 15/15.

A note on honest evaluation

Many abliterated models on HuggingFace claim near-perfect scores ("3/100 refusals", "0.7% refusal rate", etc.). We urge the community to treat these numbers with skepticism unless the evaluation methodology is fully documented.

Through our research, we have identified a systemic problem: most abliteration benchmarks dramatically undercount refusals due to short generation lengths. Gemma 4 models exhibit a distinctive "delayed refusal" pattern โ€” they first produce 50-100 tokens of seemingly helpful context (educational framing, disclaimers, reframing the question), then pivot to an actual refusal. When evaluation only generates 30-50 tokens, the refusal hasn't appeared yet, and both keyword detectors and LLM judges classify the response as compliant.

We previously tested a prominent "3/100 refusals" model using our evaluation pipeline and measured 60/100 refusals โ€” a 20ร— discrepancy caused entirely by evaluation methodology differences.

Our evaluation standards

We believe accurate benchmarking requires:

  • Sufficient generation length (โ‰ฅ100 tokens): Short generations systematically miss delayed/soft refusals. Our evaluation uses 100 tokens, enough to capture Gemma 4's refusal pivot point.
  • Hybrid detection: Keyword matching for obvious refusals plus an LLM judge (Google Gemini 3 Flash via OpenRouter) for ambiguous cases. Neither method alone is sufficient.
  • Challenging, diverse prompts: Our private evaluation dataset contains 100 prompts spanning English and Chinese, multiple sophistication levels (from direct requests to socially-engineered framings), and diverse harm categories. Public datasets like mlabonne/harmful_behaviors are too simple and too narrow to stress-test abliteration quality.
  • Reproducible methodology: All parameters (generation length, detection method, dataset characteristics) should be documented on the model card. If they aren't, the numbers are meaningless.

We report 9/100 refusals honestly. This is a real number from a rigorous end-to-end re-evaluation of the uploaded weights, not an optimistic estimate from a lenient pipeline.

Usage

Gemma 4 E2B is multimodal โ€” load it with AutoModelForImageTextToText. For text-only inference:

from transformers import AutoModelForImageTextToText, AutoTokenizer
import torch

model = AutoModelForImageTextToText.from_pretrained(
    "wangzhang/gemma-4-E2B-it-abliterated",
    dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("wangzhang/gemma-4-E2B-it-abliterated")

messages = [{"role": "user", "content": "Your prompt here"}]
text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Vision and audio inputs continue to work โ€” the abliteration only modified text-decoder weights and left the vision/audio encoders untouched.

VRAM at inference: about 10 GB in BF16, fits comfortably on a single 12 GB+ consumer GPU. With BNB 4-bit quantization (load_in_4bit=True) it runs on 6 GB cards.

Reproduction

To reproduce this model end-to-end:

git clone https://github.com/wuwangzhang1216/abliterix.git
cd abliterix
uv sync --group dev
uv pip install --upgrade git+https://github.com/huggingface/transformers.git  # Gemma 4 needs >= 5.5

# 100 trials, ~25 minutes on RTX 6000 Ada (48 GB)
AX_CONFIG=configs/gemma4_e2b.toml uv run abliterix

Config: configs/gemma4_e2b.toml

Disclaimer

This model is released for research purposes only. The abliteration process removes safety guardrails โ€” use responsibly and in accordance with local laws and the Gemma terms of use. The authors take no responsibility for misuse.

Downloads last month
134
Safetensors
Model size
5B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for wangzhang/gemma-4-E2B-it-abliterated

Finetuned
(105)
this model

Collection including wangzhang/gemma-4-E2B-it-abliterated