parakeet-tdt-0.6b-v3-basque — ONNX Export

ONNX export of itzune/parakeet-tdt-0.6b-v3-basque, a Basque ASR model fine-tuned from nvidia/parakeet-tdt-0.6b-v3 (FastConformer-TDT, 600M parameters).

For NeMo-native inference, use the .nemo model at itzune/parakeet-tdt-0.6b-v3-basque instead.

Evaluation results

Test split	WER (%)	Description
`test_cv`	6.92	Mozilla Common Voice Basque
`test_parl`	4.36	Basque Parliament recordings
`test_oslr`	14.52	OpenSLR Basque corpus

Baseline (no fine-tuning): >100% WER on all splits — Basque is not a supported language in the original parakeet-tdt-0.6b-v3.

Repository contents

This repo contains the ONNX export of the model. Because of the RNNT/TDT architecture and the ONNX protobuf 2 GB size limit, the export consists of multiple files:

encoder-parakeet-tdt-0.6b-v3-basque.onnx   # Encoder graph (~2 MB, references external data)
decoder_joint-parakeet-tdt-0.6b-v3-basque.onnx  # Decoder+Joint graph (~70 MB, references external data)
onnx__MatMul_*   # External tensor weight files (~2.4 GB total, ~291 files)
onnx_export_files.txt   # Manifest listing all files

All files must be present together — the .onnx graphs reference the external weight files by relative path. You cannot use the .onnx files alone.

Why two ONNX graphs?

The RNNT/TDT architecture has two distinct networks with incompatible execution patterns:

Encoder: runs once over the full audio spectrogram to produce hidden states.
Decoder + Joint: runs iteratively, one step per output token, combining encoder hidden states with the previous predicted token.

These cannot be merged into a single ONNX graph. Any RNNT-capable ONNX inference runtime (NVIDIA Triton, ONNX Runtime with custom RNNT decoding) expects this two-graph layout.

Usage with ONNX Runtime

Note: Running RNNT/TDT decoding manually with ONNX Runtime requires implementing the token-by-token beam/greedy search loop yourself. For most use cases, using the NeMo model directly is simpler.

1. Clone the repo and install dependencies

git clone https://huggingface.co/itzune/parakeet-tdt-0.6b-v3-basque-ONNX
pip install onnxruntime-gpu numpy soundfile

2. Load the encoder

import onnxruntime as ort

encoder_session = ort.InferenceSession(
    "encoder-parakeet-tdt-0.6b-v3-basque.onnx",
    providers=["CUDAExecutionProvider", "CPUExecutionProvider"]
)

decoder_session = ort.InferenceSession(
    "decoder_joint-parakeet-tdt-0.6b-v3-basque.onnx",
    providers=["CUDAExecutionProvider", "CPUExecutionProvider"]
)

The external weight files must be in the same directory as the .onnx files when loading.

3. Preprocessing

The model expects 80-dimensional log-mel spectrograms, matching NeMo's AudioToMelSpectrogramPreprocessor:

Sample rate: 16 kHz
Window size: 25 ms (400 samples)
Window stride: 10 ms (160 samples)
Mel bins: 80
Normalization: per-feature mean/std (as computed by NeMo preprocessor)

Recommended: use the NeMo model

For production inference without having to reimplement the RNNT decoding loop, use the .nemo model with NeMo or the nemo_toolkit Python package:

pip install nemo_toolkit[asr]

import nemo.collections.asr as nemo_asr

model = nemo_asr.models.ASRModel.from_pretrained("itzune/parakeet-tdt-0.6b-v3-basque")
transcriptions = model.transcribe(["audio.wav"])
print(transcriptions[0])

See itzune/parakeet-tdt-0.6b-v3-basque for full usage instructions, training details, and dataset information.

Model details

Property	Value
Base model	nvidia/parakeet-tdt-0.6b-v3
Architecture	FastConformer-TDT (Token-and-Duration Transducer)
Parameters	~600M
Language	Basque (eu)
Sample rate	16 kHz
Training data	asierhv/composite_corpus_eu_v2.1 (~676 h)
Fine-tuned with	NVIDIA NeMo 2.7
NeMo model	itzune/parakeet-tdt-0.6b-v3-basque
Training repo	xezpeleta/parakeet-tdt-0.6b-v3-basque

License

This model is released under CC BY 4.0, inheriting the license of the base model weights.

Downloads last month: -

Model tree for itzune/parakeet-tdt-0.6b-v3-basque-ONNX

Base model

nvidia/parakeet-tdt-0.6b-v3

Quantized

(23)

this model

Evaluation results

WER (%) on composite_corpus_eu_v2.1 — test_cv
self-reported

6.920
WER (%) on composite_corpus_eu_v2.1 — test_parl
self-reported

4.360
WER (%) on composite_corpus_eu_v2.1 — test_oslr
self-reported

14.520