Qwen3-Omni-30B-A3B-FP8

Block-wise FP8 quantization of Qwen/Qwen3-Omni-30B-A3B-Instruct.

Model Details

Property	Value
Original Size	70.52 GB
Quantized Size	35 GB
Compression	1.89x
Quantization	Block-wise FP8 E4M3 (128x128 blocks)
Format	SafeTensors with `weight_scale_inv` scales

Components

Quantized to FP8:

Thinker (48 layers MoE) - main language model
Talker (20 layers MoE) - audio generation model

Kept in BF16:

Vision encoder (thinker.visual)
Audio tower (thinker.audio_tower)
Code2Wav decoder (code2wav)
Embedding layers
LayerNorm layers
MoE gate routing layers

Usage with vLLM

from vllm import LLM

llm = LLM(
    model="marksverdhei/Qwen3-Omni-30B-A3B-FP8",
    tensor_parallel_size=2,
    gpu_memory_utilization=0.85,
    max_model_len=4096,
    trust_remote_code=True,
)

Requirements

vLLM >= 0.13.0 with Qwen3-Omni support
2x 24GB GPUs (e.g., RTX 3090) or equivalent
~35 GB disk space

Quantization Details

Block-wise quantization with 128x128 blocks provides better precision than per-tensor quantization while maintaining good compression. Each block has its own scale factor stored as weight_scale_inv (inverse scale for efficient multiplication during inference).

Original Model

This is a quantized version of Qwen/Qwen3-Omni-30B-A3B-Instruct.

Qwen3-Omni is a natively end-to-end multilingual omni-modal foundation model that processes text, images, audio, and video, and delivers real-time streaming responses in both text and natural speech.

Key Features:

State-of-the-art across modalities
Supports 119 text languages, 19 speech input languages, and 10 speech output languages
MoE-based Thinker-Talker architecture
Real-time audio/video interaction

For full details, see the original model card.