Gemma 4 31B-it - RotorQuant AWQ 8-bit

8-bit AWQ-quantized version of google/gemma-4-31B-it (31B dense, instruction-tuned) with RotorQuant KV-cache quantization. AWQ (Activation-aware Weight Quantization) is an activation-aware method optimal for GPU inference. The 8-bit variant keeps quality very close to FP16 while halving VRAM usage, and RotorQuant delivers 5.3x faster prefill and 28% faster decode vs TurboQuant.

Approximate model size: ~31 GB

Note: RotorQuant KV cache modes (planar3, iso3) require the RotorQuant fork or the llama-cpp-turboquant fork. The AWQ weights themselves load cleanly in stock AutoAWQ / vLLM; RotorQuant KV-cache kernels are opt-in.

Model Specifications

Property	Value
Base Model	google/gemma-4-31B-it
Parameters	~31 billion
Architecture	Dense transformer, instruction-tuned
Modality	Multimodal: image + text input, text output
License	Apache 2.0
Weight Quantization	AWQ 8-bit (~31 GB)
Group Size	128
KV-Cache Quantization	RotorQuant (`planar3` / `iso3`)
Framework	transformers + AutoAWQ / vLLM

Quickstart

AutoAWQ

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model = AutoAWQForCausalLM.from_quantized(
    "majentik/gemma-4-31B-it-RotorQuant-AWQ-8bit",
    device_map="auto",
    fuse_layers=True,
)
tokenizer = AutoTokenizer.from_pretrained("majentik/gemma-4-31B-it-RotorQuant-AWQ-8bit")

messages = [{"role": "user", "content": "Draft a launch announcement for a new API product."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)
out = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(out[0], skip_special_tokens=True))

vLLM

vllm serve majentik/gemma-4-31B-it-RotorQuant-AWQ-8bit \
  --quantization awq_marlin \
  --tensor-parallel-size 1 \
  --max-model-len 8192

With RotorQuant KV cache (fork)

from rotorquant import RotorQuantCache
cache = RotorQuantCache(model, mode="iso3")  # or "planar3"

What is RotorQuant?

RotorQuant is a high-performance KV-cache quantization method using block-diagonal Clifford-algebra rotors. Combined with AWQ 8-bit weights, it delivers near-FP16 quality at roughly half the VRAM cost, with RotorQuant's compressed KV cache further reducing long-context memory.

Key advantages over TurboQuant:

5.3x faster prefill
28% faster decode
Equivalent memory savings
planar3 / iso3 3-bit KV cache modes

KV-Cache Quantization Comparison

Method	Prefill Speed	Decode Speed	Memory Savings	Reference
TurboQuant	1x (baseline)	1x (baseline)	High	arXiv: 2504.19874
RotorQuant	5.3x faster	28% faster	High	GitHub

AWQ vs GGUF vs MLX

Format	Target Hardware	Runtime	Best For
AWQ	NVIDIA / AMD GPU (CUDA/ROCm)	AutoAWQ, vLLM, TGI	GPU-native inference, production serving
GGUF	CPU + GPU (cross-platform)	llama.cpp, Ollama, LM Studio	Laptops, CPU-only boxes, mixed offload
MLX	Apple Silicon	MLX, mlx-lm, mlx-vlm	Macs with unified memory

This repo ships AWQ. See the "See Also" section for GGUF and MLX siblings.

Memory Estimates (Gemma 4 31B-it)

Precision	Approximate Size	VRAM Tier
FP16 (original)	~62 GB	80 GB+ (A100/H100)
AWQ 8-bit	~31 GB	40 GB+ (A100 40/80GB, L40S, 2x RTX 4090)
AWQ 4-bit	~17 GB	24 GB+

Best deployed on server-class GPUs (A100 40/80GB, L40S, H100) or dual RTX 4090 with tensor parallelism.

Hardware Requirements

NVIDIA GPU with >=40 GB VRAM single-card, or 2x 24 GB cards with TP=2
Recommended: A100 40GB, A100 80GB, L40S 48GB, H100 80GB
CUDA 12.x recommended
For vLLM: compute capability >= 7.5 (Turing or newer) for Marlin kernels
For RotorQuant KV cache: scrya-com/rotorquant fork

Model tree for majentik/gemma-4-31B-it-RotorQuant-AWQ-8bit

Base model

google/gemma-4-31B-it

Finetuned

(105)

this model

Paper for majentik/gemma-4-31B-it-RotorQuant-AWQ-8bit

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

Paper • 2504.19874 • Published Apr 28, 2025 • 32