Qwen3-Omni-30B-A3B-FP8
Block-wise FP8 quantization of Qwen/Qwen3-Omni-30B-A3B-Instruct.
Model Details
| Property | Value |
|---|---|
| Original Size | 70.52 GB |
| Quantized Size | 35 GB |
| Compression | 1.89x |
| Quantization | Block-wise FP8 E4M3 (128x128 blocks) |
| Format | SafeTensors with weight_scale_inv scales |
Components
Quantized to FP8:
- Thinker (48 layers MoE) - main language model
- Talker (20 layers MoE) - audio generation model
Kept in BF16:
- Vision encoder (
thinker.visual) - Audio tower (
thinker.audio_tower) - Code2Wav decoder (
code2wav) - Embedding layers
- LayerNorm layers
- MoE gate routing layers
Usage with vLLM
from vllm import LLM
llm = LLM(
model="marksverdhei/Qwen3-Omni-30B-A3B-FP8",
tensor_parallel_size=2,
gpu_memory_utilization=0.85,
max_model_len=4096,
trust_remote_code=True,
)
Requirements
- vLLM >= 0.13.0 with Qwen3-Omni support
- 2x 24GB GPUs (e.g., RTX 3090) or equivalent
- ~35 GB disk space
Quantization Details
Block-wise quantization with 128x128 blocks provides better precision than per-tensor quantization while maintaining good compression. Each block has its own scale factor stored as weight_scale_inv (inverse scale for efficient multiplication during inference).
Original Model
This is a quantized version of Qwen/Qwen3-Omni-30B-A3B-Instruct.
Qwen3-Omni is a natively end-to-end multilingual omni-modal foundation model that processes text, images, audio, and video, and delivers real-time streaming responses in both text and natural speech.
Key Features:
- State-of-the-art across modalities
- Supports 119 text languages, 19 speech input languages, and 10 speech output languages
- MoE-based Thinker-Talker architecture
- Real-time audio/video interaction
For full details, see the original model card.
- Downloads last month
- 1,459
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for mohhaamed/Qwen3-Omni-30B-A3B-FP8
Base model
Qwen/Qwen3-Omni-30B-A3B-Instruct