Parakeet TDT 0.6B v3 β€” GGUF (ggml-quantised)

GGUF / ggml conversions of nvidia/parakeet-tdt-0.6b-v3 for use with the parakeet-main CLI from CrispStrobe/CrispASR@parakeet.

Parakeet TDT 0.6B v3 is NVIDIA's 600 M-parameter multilingual ASR model:

  • 25 European languages with automatic language detection (no prompt prefix needed)
  • Built-in word-level timestamps from the TDT (Token-and-Duration Transducer) decoder β€” no separate CTC alignment model required
  • 6.34 % avg WER on the HuggingFace Open ASR Leaderboard
  • CC-BY-4.0 licence (friendlier than most ASR models)

This repo provides four quantisations, all converted from the same .nemo checkpoint via the streaming convert-parakeet-to-gguf.py script and quantised with cohere-quantize.

Files

File Size Notes
parakeet-tdt-0.6b-v3.gguf 1.26 GB F16, full precision
parakeet-tdt-0.6b-v3-q8_0.gguf 711 MB Q8_0, near-lossless
parakeet-tdt-0.6b-v3-q5_0.gguf 516 MB Q5_0
parakeet-tdt-0.6b-v3-q4_k.gguf 467 MB Q4_K β€” recommended default

All quantisations produce identical text on samples/jfk.wav:

And so my fellow Americans. Ask not what your country can do for you. Ask what you can do for your country.

Quick Start

# 1. Build the runtime
git clone -b parakeet https://github.com/CrispStrobe/CrispASR
cd CrispASR
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc) --target parakeet-main

# 2. Download a quantisation
huggingface-cli download cstr/parakeet-tdt-0.6b-v3-GGUF \
    parakeet-tdt-0.6b-v3-q4_k.gguf --local-dir .

# 3. Transcribe
./build/bin/parakeet-main \
    -m parakeet-tdt-0.6b-v3-q4_k.gguf \
    -f your-audio.wav -t 8

Word-level timestamps for free

Pass -v to dump per-token timestamps from the TDT duration head. Each token spans one or more encoder frames; one frame = 80 ms. No separate alignment model is required.

$ ./build/bin/parakeet-main -m parakeet-tdt-0.6b-v3-q4_k.gguf -f samples/jfk.wav -t 8 -v
  [    0.32s β†’     0.64s]  ' And'
  [    0.64s β†’     0.88s]  ' so'
  [    1.04s β†’     1.28s]  ' my'
  [    1.28s β†’     1.76s]  ' fellow'        ← f + ell + ow grouped
  [    1.76s β†’     2.56s]  ' Americans'
  [    2.96s β†’     3.28s]  '.'
  [    3.28s β†’     3.84s]  ' Ask'
  [    4.08s β†’     4.40s]  ' not'
  [    5.28s β†’     5.92s]  ' what your'
  ...

This is roughly 10Γ— tighter than the cross-attention DTW path used for Cohere Transcribe word timestamps (~360 ms MAE), and comparable to running a separate wav2vec2 + CTC forced alignment model β€” but at zero extra cost.

Model architecture

Component Details
Encoder 24-layer FastConformer, d=1024, 8 heads, head_dim=128, FFN=4096, conv kernel=9
Subsampling Conv2d dw_striding stack, 8Γ— temporal (50 β†’ 12.5 fps)
Predictor 2-layer LSTM, hidden 640, embed 8193 Γ— 640
Joint head enc(1024 β†’ 640) + pred(640 β†’ 640) β†’ ReLU β†’ linear(640 β†’ 8198)
Vocab 8192 SentencePiece tokens (multilingual)
Audio 16 kHz mono, 128 mel bins, n_fft=512, hop=160, win=400
Parameters ~600 M

The mel filterbank and Hann window are baked directly into the GGUF (preprocessor.fb and preprocessor.window from the original .nemo checkpoint), so there is no recomputation at runtime. BatchNorm in the convolution module is folded into the depthwise conv weights at load time.

How this was made

  1. The .nemo checkpoint was unpacked, NeMo state-dict keys were remapped to ggml-friendly names, and weights were written to GGUF F16 (matmul tensors) + F32 (norms / biases / mel filterbank). A synthetic zero conv.dw.bias is added per encoder layer so the runtime BN-fold pass has somewhere to write the absorbed bias shift.
  2. Quantised variants are produced by cohere-quantize (the same llama.cpp-style quantiser used for the Cohere Transcribe GGUFs).
  3. Inference is implemented in src/parakeet.{h,cpp}: the FastConformer encoder runs as a single ggml graph (BN folded out), the LSTM predictor and joint head run as manual F32 CPU loops, and the TDT greedy decode loop alternates "advance encoder frame" / "emit token + advance predictor" using the duration head's argmax.

Supported languages

bg cs da de el en es et fi fr hr hu it lt lv mt nl pl pt ro ru sk sl sv uk

The model auto-detects the language at inference time. No prompt prefix or -l flag is needed.

Attribution

  • Original model: nvidia/parakeet-tdt-0.6b-v3 (CC-BY-4.0). NVIDIA NeMo team.
  • GGUF conversion + ggml runtime: CrispStrobe/CrispASR@parakeet β€” community contribution. Encoder graph borrows the dw_striding subsampling + Conformer block patterns from the same fork's cohere.cpp.
  • Reference inference: istupakov/onnx-asr was the cross-check for the joint head + TDT greedy loop.

Related

License

CC-BY-4.0, inherited from the base model. Use of these GGUF files must comply with the CC-BY-4.0 license including attribution.

Downloads last month
800
GGUF
Model size
0.6B params
Architecture
parakeet
Hardware compatibility
Log In to add your hardware

5-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for cstr/parakeet-tdt-0.6b-v3-GGUF

Quantized
(23)
this model