Cohere Transcribe FP8
FP8-quantized version of CohereLabs/cohere-transcribe-03-2026 (2B parameter ASR model).
40% smaller (2.48 GB vs 4.0 GB) with no accuracy loss and up to 496x realtime on NVIDIA DGX Spark.
Quick Start
from load_fp8 import load_model
# Load with native FP8 compute + torch.compile (fastest)
model, processor = load_model("./", compile=True)
# Or load without torchao (dequantizes to BF16, still accurate)
# model, processor = load_model("./", native_fp8=False)
from transformers.audio_utils import load_audio
audio = load_audio("your_audio.wav", sampling_rate=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt", language="en")
inputs.to(model.device, dtype=model.dtype)
outputs = model.generate(**inputs, max_new_tokens=256, cache_implementation="static")
text = processor.decode(outputs, skip_special_tokens=True)
print(text)
Requirements
# Minimum (BF16 dequantized mode)
pip install transformers safetensors torch
# Full FP8 + compile (recommended)
pip install transformers safetensors torch torchao
How It Works
Encoder linear weights are stored as FP8 (float8_e4m3fn) data + per-tensor scales in standard safetensors format. The custom load_fp8.py loader supports two modes:
| Mode | Flag | Needs torchao | Speed | Size in memory |
|---|---|---|---|---|
| FP8 + compile (recommended) | compile=True |
Yes | RTFx 496 | ~2 GB |
| FP8 native | native_fp8=True |
Yes | RTFx 47 | ~2 GB |
| BF16 dequantized | native_fp8=False |
No | RTFx 70 | ~4 GB |
FP8 without compile is slower due to kernel dispatch overhead. Always use compile=True for production.
WER Comparison (Open ASR Leaderboard)
Evaluated on the HuggingFace Open ASR Leaderboard datasets. FP8 quantization causes zero accuracy degradation.
| Dataset | FP8 (this model) | Original BF16 | Reported (A100) |
|---|---|---|---|
| LibriSpeech clean | 1.27% | 1.26% | 1.25% |
| LibriSpeech other | 2.35% | 2.37% | 2.37% |
| SPGISpeech | 2.75% | -- | 3.08% |
| TED-LIUM | 3.47% | 3.49% | 2.49% |
| VoxPopuli | 5.57% | -- | 5.87% |
| AMI | 9.01% | 8.99% | 8.15% |
| Earnings-22 | 11.0% | -- | 10.84% |
Performance on NVIDIA DGX Spark (GB10)
Benchmarked on NVIDIA DGX Spark with GB10 GPU (SM 12.1), CUDA 13.0, PyTorch 2.11.
Real-World Transcription Speed
Tested on a 29.4 minute audiobook (LibriVox, "The Sign of the Cross", Chapter 1):
| Run | Time | RTFx |
|---|---|---|
| Cold (no warmup) | 11.46s | 154x |
| Warmed up (run 1) | 3.74s | 471x |
| Warmed up (run 2) | 3.55s | 496x |
| Warmed up (run 3) | 3.56s | 495x |
29.4 minutes of audio transcribed in 3.5 seconds.
Benchmark Comparison
| Config | Short (5.4s) | Long (65.3s) |
|---|---|---|
| FP8 + compile + static cache | RTFx 77 | RTFx 486 |
| BF16 baseline | RTFx 71 | RTFx 311 |
| ONNX Runtime INT8 (CPU only) | RTFx 14 | RTFx 13 |
Model Size
| Format | Size | Reduction |
|---|---|---|
| Original (FP16) | 4.0 GB | -- |
| This model (FP8 safetensors) | 2.48 GB | 38% |
Running on DGX Spark
# Clone this repo
git clone https://huggingface.co/<your-username>/cohere-transcribe-fp8
cd cohere-transcribe-fp8
# Install deps
pip install transformers safetensors torch torchao
# Set triton cache (needed for torch.compile on Spark)
export TRITON_CACHE_DIR=/tmp/triton_cache
# Run
python load_fp8.py --model_path . --compile --language en
First run takes ~40s for compile warmup. Subsequent inference is near-instant.
For custom audio:
python load_fp8.py --model_path . --compile --audio your_file.wav --language en
Quantization Details
- Method: torchao
Float8DynamicActivationFloat8WeightConfig(post-training, no calibration needed) - Scope: Encoder linear layers only (433 layers). Decoder, convolutions, norms, and embeddings remain BF16.
- Storage: Each quantized weight stored as
{name}.qdata(float8_e4m3fn) +{name}.scale(float32) in safetensors - Reconstruction:
load_fp8.pydequantizes to BF16 then re-quantizes with torchao for native FP8 compute
Supported Languages
Arabic, Chinese (Mandarin), Dutch, English, French, German, Greek, Italian, Japanese, Korean, Polish, Portuguese, Spanish, Vietnamese
Citation
Base model by Cohere. FP8 quantization and benchmarks on DGX Spark.
License
Apache 2.0 (same as base model)
- Downloads last month
- 101
Model tree for BarathwajAnandan/cohere-transcribe-fp8
Base model
CohereLabs/cohere-transcribe-03-2026