Gemma 4 31B-it - RotorQuant AWQ 8-bit
8-bit AWQ-quantized version of google/gemma-4-31B-it (31B dense, instruction-tuned) with RotorQuant KV-cache quantization. AWQ (Activation-aware Weight Quantization) is an activation-aware method optimal for GPU inference. The 8-bit variant keeps quality very close to FP16 while halving VRAM usage, and RotorQuant delivers 5.3x faster prefill and 28% faster decode vs TurboQuant.
Approximate model size: ~31 GB
Note: RotorQuant KV cache modes (
planar3,iso3) require the RotorQuant fork or the llama-cpp-turboquant fork. The AWQ weights themselves load cleanly in stock AutoAWQ / vLLM; RotorQuant KV-cache kernels are opt-in.
Model Specifications
| Property | Value |
|---|---|
| Base Model | google/gemma-4-31B-it |
| Parameters | ~31 billion |
| Architecture | Dense transformer, instruction-tuned |
| Modality | Multimodal: image + text input, text output |
| License | Apache 2.0 |
| Weight Quantization | AWQ 8-bit (~31 GB) |
| Group Size | 128 |
| KV-Cache Quantization | RotorQuant (planar3 / iso3) |
| Framework | transformers + AutoAWQ / vLLM |
Quickstart
AutoAWQ
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model = AutoAWQForCausalLM.from_quantized(
"majentik/gemma-4-31B-it-RotorQuant-AWQ-8bit",
device_map="auto",
fuse_layers=True,
)
tokenizer = AutoTokenizer.from_pretrained("majentik/gemma-4-31B-it-RotorQuant-AWQ-8bit")
messages = [{"role": "user", "content": "Draft a launch announcement for a new API product."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)
out = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(out[0], skip_special_tokens=True))
vLLM
vllm serve majentik/gemma-4-31B-it-RotorQuant-AWQ-8bit \
--quantization awq_marlin \
--tensor-parallel-size 1 \
--max-model-len 8192
With RotorQuant KV cache (fork)
from rotorquant import RotorQuantCache
cache = RotorQuantCache(model, mode="iso3") # or "planar3"
What is RotorQuant?
RotorQuant is a high-performance KV-cache quantization method using block-diagonal Clifford-algebra rotors. Combined with AWQ 8-bit weights, it delivers near-FP16 quality at roughly half the VRAM cost, with RotorQuant's compressed KV cache further reducing long-context memory.
Key advantages over TurboQuant:
- 5.3x faster prefill
- 28% faster decode
- Equivalent memory savings
planar3/iso33-bit KV cache modes
KV-Cache Quantization Comparison
| Method | Prefill Speed | Decode Speed | Memory Savings | Reference |
|---|---|---|---|---|
| TurboQuant | 1x (baseline) | 1x (baseline) | High | arXiv: 2504.19874 |
| RotorQuant | 5.3x faster | 28% faster | High | GitHub |
AWQ vs GGUF vs MLX
| Format | Target Hardware | Runtime | Best For |
|---|---|---|---|
| AWQ | NVIDIA / AMD GPU (CUDA/ROCm) | AutoAWQ, vLLM, TGI | GPU-native inference, production serving |
| GGUF | CPU + GPU (cross-platform) | llama.cpp, Ollama, LM Studio | Laptops, CPU-only boxes, mixed offload |
| MLX | Apple Silicon | MLX, mlx-lm, mlx-vlm | Macs with unified memory |
This repo ships AWQ. See the "See Also" section for GGUF and MLX siblings.
Memory Estimates (Gemma 4 31B-it)
| Precision | Approximate Size | VRAM Tier |
|---|---|---|
| FP16 (original) | ~62 GB | 80 GB+ (A100/H100) |
| AWQ 8-bit | ~31 GB | 40 GB+ (A100 40/80GB, L40S, 2x RTX 4090) |
| AWQ 4-bit | ~17 GB | 24 GB+ |
Best deployed on server-class GPUs (A100 40/80GB, L40S, H100) or dual RTX 4090 with tensor parallelism.
Hardware Requirements
- NVIDIA GPU with >=40 GB VRAM single-card, or 2x 24 GB cards with TP=2
- Recommended: A100 40GB, A100 80GB, L40S 48GB, H100 80GB
- CUDA 12.x recommended
- For vLLM: compute capability >= 7.5 (Turing or newer) for Marlin kernels
- For RotorQuant KV cache: scrya-com/rotorquant fork
See Also
- google/gemma-4-31B-it -- Base model
- majentik/gemma-4-31B-it-RotorQuant -- RotorQuant KV-cache only (transformers)
- majentik/gemma-4-31B-it-RotorQuant-AWQ-4bit -- AWQ 4-bit variant
- majentik/gemma-4-31B-it-TurboQuant-AWQ-8bit -- TurboQuant AWQ 8-bit variant
- majentik/gemma-4-31B-it-RotorQuant-MLX-8bit -- MLX variant (Apple Silicon)
- RotorQuant GitHub
- llama-cpp-turboquant fork
- AutoAWQ
- vLLM
Model tree for majentik/gemma-4-31B-it-RotorQuant-AWQ-8bit
Base model
google/gemma-4-31B-it