FP8 quants
Collection
A collection of my FP8 quants for models missing this. • 2 items • Updated • 1
Block-wise FP8 quantization of Qwen/Qwen3-Omni-30B-A3B-Instruct.
| Property | Value |
|---|---|
| Original Size | 70.52 GB |
| Quantized Size | 35 GB |
| Compression | 1.89x |
| Quantization | Block-wise FP8 E4M3 (128x128 blocks) |
| Format | SafeTensors with weight_scale_inv scales |
Quantized to FP8:
Kept in BF16:
thinker.visual)thinker.audio_tower)code2wav)from vllm import LLM
llm = LLM(
model="marksverdhei/Qwen3-Omni-30B-A3B-FP8",
tensor_parallel_size=2,
gpu_memory_utilization=0.85,
max_model_len=4096,
trust_remote_code=True,
)
Block-wise quantization with 128x128 blocks provides better precision than per-tensor quantization while maintaining good compression. Each block has its own scale factor stored as weight_scale_inv (inverse scale for efficient multiplication during inference).
This is a quantized version of Qwen/Qwen3-Omni-30B-A3B-Instruct.
Qwen3-Omni is a natively end-to-end multilingual omni-modal foundation model that processes text, images, audio, and video, and delivers real-time streaming responses in both text and natural speech.
Key Features:
For full details, see the original model card.
Base model
Qwen/Qwen3-Omni-30B-A3B-Instruct