Cohere Transcribe FP8

FP8-quantized version of CohereLabs/cohere-transcribe-03-2026 (2B parameter ASR model).

40% smaller (2.48 GB vs 4.0 GB) with no accuracy loss and up to 496x realtime on NVIDIA DGX Spark.

Quick Start

from load_fp8 import load_model

# Load with native FP8 compute + torch.compile (fastest)
model, processor = load_model("./", compile=True)

# Or load without torchao (dequantizes to BF16, still accurate)
# model, processor = load_model("./", native_fp8=False)

from transformers.audio_utils import load_audio
audio = load_audio("your_audio.wav", sampling_rate=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt", language="en")
inputs.to(model.device, dtype=model.dtype)

outputs = model.generate(**inputs, max_new_tokens=256, cache_implementation="static")
text = processor.decode(outputs, skip_special_tokens=True)
print(text)

Requirements

# Minimum (BF16 dequantized mode)
pip install transformers safetensors torch

# Full FP8 + compile (recommended)
pip install transformers safetensors torch torchao

How It Works

Encoder linear weights are stored as FP8 (float8_e4m3fn) data + per-tensor scales in standard safetensors format. The custom load_fp8.py loader supports two modes:

Mode	Flag	Needs torchao	Speed	Size in memory
FP8 + compile (recommended)	`compile=True`	Yes	RTFx 496	~2 GB
FP8 native	`native_fp8=True`	Yes	RTFx 47	~2 GB
BF16 dequantized	`native_fp8=False`	No	RTFx 70	~4 GB

FP8 without compile is slower due to kernel dispatch overhead. Always use compile=True for production.

WER Comparison (Open ASR Leaderboard)

Evaluated on the HuggingFace Open ASR Leaderboard datasets. FP8 quantization causes zero accuracy degradation.

Dataset	FP8 (this model)	Original BF16	Reported (A100)
LibriSpeech clean	1.27%	1.26%	1.25%
LibriSpeech other	2.35%	2.37%	2.37%
SPGISpeech	2.75%	--	3.08%
TED-LIUM	3.47%	3.49%	2.49%
VoxPopuli	5.57%	--	5.87%
AMI	9.01%	8.99%	8.15%
Earnings-22	11.0%	--	10.84%

Performance on NVIDIA DGX Spark (GB10)

Benchmarked on NVIDIA DGX Spark with GB10 GPU (SM 12.1), CUDA 13.0, PyTorch 2.11.

Real-World Transcription Speed

Tested on a 29.4 minute audiobook (LibriVox, "The Sign of the Cross", Chapter 1):

Run	Time	RTFx
Cold (no warmup)	11.46s	154x
Warmed up (run 1)	3.74s	471x
Warmed up (run 2)	3.55s	496x
Warmed up (run 3)	3.56s	495x

29.4 minutes of audio transcribed in 3.5 seconds.

Benchmark Comparison

Config	Short (5.4s)	Long (65.3s)
FP8 + compile + static cache	RTFx 77	RTFx 486
BF16 baseline	RTFx 71	RTFx 311
ONNX Runtime INT8 (CPU only)	RTFx 14	RTFx 13

Model Size

Format	Size	Reduction
Original (FP16)	4.0 GB	--
This model (FP8 safetensors)	2.48 GB	38%

Running on DGX Spark

# Clone this repo
git clone https://huggingface.co/<your-username>/cohere-transcribe-fp8
cd cohere-transcribe-fp8

# Install deps
pip install transformers safetensors torch torchao

# Set triton cache (needed for torch.compile on Spark)
export TRITON_CACHE_DIR=/tmp/triton_cache

# Run
python load_fp8.py --model_path . --compile --language en

First run takes ~40s for compile warmup. Subsequent inference is near-instant.

For custom audio:

python load_fp8.py --model_path . --compile --audio your_file.wav --language en

Quantization Details

Method: torchao Float8DynamicActivationFloat8WeightConfig (post-training, no calibration needed)
Scope: Encoder linear layers only (433 layers). Decoder, convolutions, norms, and embeddings remain BF16.
Storage: Each quantized weight stored as {name}.qdata (float8_e4m3fn) + {name}.scale (float32) in safetensors
Reconstruction: load_fp8.py dequantizes to BF16 then re-quantizes with torchao for native FP8 compute

Supported Languages

Arabic, Chinese (Mandarin), Dutch, English, French, German, Greek, Italian, Japanese, Korean, Polish, Portuguese, Spanish, Vietnamese

Citation

Base model by Cohere. FP8 quantization and benchmarks on DGX Spark.

License

Apache 2.0 (same as base model)

Downloads last month: 101

Model tree for BarathwajAnandan/cohere-transcribe-fp8

Base model

CohereLabs/cohere-transcribe-03-2026

Quantized

(24)

this model