Cohere Transcribe FP8

FP8-quantized version of CohereLabs/cohere-transcribe-03-2026 (2B parameter ASR model).

40% smaller (2.48 GB vs 4.0 GB) with no accuracy loss and up to 496x realtime on NVIDIA DGX Spark.

Quick Start

from load_fp8 import load_model

# Load with native FP8 compute + torch.compile (fastest)
model, processor = load_model("./", compile=True)

# Or load without torchao (dequantizes to BF16, still accurate)
# model, processor = load_model("./", native_fp8=False)

from transformers.audio_utils import load_audio
audio = load_audio("your_audio.wav", sampling_rate=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt", language="en")
inputs.to(model.device, dtype=model.dtype)

outputs = model.generate(**inputs, max_new_tokens=256, cache_implementation="static")
text = processor.decode(outputs, skip_special_tokens=True)
print(text)

Requirements

# Minimum (BF16 dequantized mode)
pip install transformers safetensors torch

# Full FP8 + compile (recommended)
pip install transformers safetensors torch torchao

How It Works

Encoder linear weights are stored as FP8 (float8_e4m3fn) data + per-tensor scales in standard safetensors format. The custom load_fp8.py loader supports two modes:

Mode Flag Needs torchao Speed Size in memory
FP8 + compile (recommended) compile=True Yes RTFx 496 ~2 GB
FP8 native native_fp8=True Yes RTFx 47 ~2 GB
BF16 dequantized native_fp8=False No RTFx 70 ~4 GB

FP8 without compile is slower due to kernel dispatch overhead. Always use compile=True for production.

WER Comparison (Open ASR Leaderboard)

Evaluated on the HuggingFace Open ASR Leaderboard datasets. FP8 quantization causes zero accuracy degradation.

Dataset FP8 (this model) Original BF16 Reported (A100)
LibriSpeech clean 1.27% 1.26% 1.25%
LibriSpeech other 2.35% 2.37% 2.37%
SPGISpeech 2.75% -- 3.08%
TED-LIUM 3.47% 3.49% 2.49%
VoxPopuli 5.57% -- 5.87%
AMI 9.01% 8.99% 8.15%
Earnings-22 11.0% -- 10.84%

Performance on NVIDIA DGX Spark (GB10)

Benchmarked on NVIDIA DGX Spark with GB10 GPU (SM 12.1), CUDA 13.0, PyTorch 2.11.

Real-World Transcription Speed

Tested on a 29.4 minute audiobook (LibriVox, "The Sign of the Cross", Chapter 1):

Run Time RTFx
Cold (no warmup) 11.46s 154x
Warmed up (run 1) 3.74s 471x
Warmed up (run 2) 3.55s 496x
Warmed up (run 3) 3.56s 495x

29.4 minutes of audio transcribed in 3.5 seconds.

Benchmark Comparison

Config Short (5.4s) Long (65.3s)
FP8 + compile + static cache RTFx 77 RTFx 486
BF16 baseline RTFx 71 RTFx 311
ONNX Runtime INT8 (CPU only) RTFx 14 RTFx 13

Model Size

Format Size Reduction
Original (FP16) 4.0 GB --
This model (FP8 safetensors) 2.48 GB 38%

Running on DGX Spark

# Clone this repo
git clone https://huggingface.co/<your-username>/cohere-transcribe-fp8
cd cohere-transcribe-fp8

# Install deps
pip install transformers safetensors torch torchao

# Set triton cache (needed for torch.compile on Spark)
export TRITON_CACHE_DIR=/tmp/triton_cache

# Run
python load_fp8.py --model_path . --compile --language en

First run takes ~40s for compile warmup. Subsequent inference is near-instant.

For custom audio:

python load_fp8.py --model_path . --compile --audio your_file.wav --language en

Quantization Details

  • Method: torchao Float8DynamicActivationFloat8WeightConfig (post-training, no calibration needed)
  • Scope: Encoder linear layers only (433 layers). Decoder, convolutions, norms, and embeddings remain BF16.
  • Storage: Each quantized weight stored as {name}.qdata (float8_e4m3fn) + {name}.scale (float32) in safetensors
  • Reconstruction: load_fp8.py dequantizes to BF16 then re-quantizes with torchao for native FP8 compute

Supported Languages

Arabic, Chinese (Mandarin), Dutch, English, French, German, Greek, Italian, Japanese, Korean, Polish, Portuguese, Spanish, Vietnamese

Citation

Base model by Cohere. FP8 quantization and benchmarks on DGX Spark.

License

Apache 2.0 (same as base model)

Downloads last month
101
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for BarathwajAnandan/cohere-transcribe-fp8

Quantized
(24)
this model