parakeet-unified-en-0.6b ONNX
ONNX export of nvidia/parakeet-unified-en-0.6b for CPU/GPU inference without NeMo. Includes fp32 and dynamically-quantized int8 variants.
Files
onnx_fp32/
encoder.onnx (40 MB graph + 2.3 GB external weights)
encoder.onnx.data
decoder_joint.onnx (7.5 KB graph + 34 MB external weights)
decoder_joint.onnx.data
tokenizer.model
onnx_int8/
encoder.int8.onnx (624 MB, self-contained)
decoder_joint.int8.onnx (8.6 MB, self-contained)
tokenizer.model
| Variant | Total size | Peak RAM (CPU) |
|---|---|---|
| fp32 | ~2.4 GB | ~2.7 GB |
| int8 | ~633 MB | ~1 GB |
Quick start
pip install -r requirements.txt
import numpy as np
import onnxruntime as ort
import sentencepiece as spm
import soundfile as sf
import torch
import torchaudio
# --- config ---
ENCODER = "onnx_int8/encoder.int8.onnx" # or onnx_fp32/encoder.onnx
DECODER = "onnx_int8/decoder_joint.int8.onnx"
TOKENIZER = "onnx_int8/tokenizer.model"
SR = 16000
BLANK = 1024
# --- load audio ---
audio, sr = sf.read("audio.wav", dtype="float32")
wav = torch.from_numpy(audio).unsqueeze(0) # (1, samples)
# --- mel features (must match NeMo preprocessor) ---
mel_xform = torchaudio.transforms.MelSpectrogram(
sample_rate=SR, n_fft=512, win_length=400, hop_length=160,
n_mels=128, window_fn=torch.hann_window, power=2.0,
norm="slaney", mel_scale="slaney", center=True,
)
mel = torch.log(mel_xform(wav) + 2**-24)
mel = (mel - mel.mean(dim=-1, keepdim=True)) / (mel.std(dim=-1, keepdim=True) + 1e-5)
feats = mel.numpy().astype(np.float32)
# --- encoder ---
enc = ort.InferenceSession(ENCODER, providers=["CPUExecutionProvider"])
enc_out, enc_len = enc.run(None, {"audio_signal": feats, "length": np.array([feats.shape[-1]], dtype=np.int64)})
# --- greedy RNN-T decode ---
dec = ort.InferenceSession(DECODER, providers=["CPUExecutionProvider"])
state1 = np.zeros((2, 1, 640), dtype=np.float32)
state2 = np.zeros((2, 1, 640), dtype=np.float32)
last_token, tokens = BLANK, []
for t in range(int(enc_len[0])):
enc_t = enc_out[:, :, t:t+1]
for _ in range(10):
logits, _, s1, s2 = dec.run(None, {
"encoder_outputs": enc_t,
"targets": np.array([[last_token]], dtype=np.int32),
"target_length": np.array([1], dtype=np.int32),
"input_states_1": state1, "input_states_2": state2,
})
idx = int(np.argmax(logits[0, 0, 0]))
if idx == BLANK:
break
tokens.append(idx)
last_token = idx
state1, state2 = s1, s2
sp = spm.SentencePieceProcessor(model_file=TOKENIZER)
print(sp.decode(tokens))
ONNX I/O signatures
Encoder
| Direction | Name | Shape | Type |
|---|---|---|---|
| input | audio_signal |
(batch, 128, time) |
float32 |
| input | length |
(batch,) |
int64 |
| output | outputs |
(batch, 1024, time') |
float32 |
| output | encoded_lengths |
(batch,) |
int64 |
Decoder + Joint
| Direction | Name | Shape | Type |
|---|---|---|---|
| input | encoder_outputs |
(batch, 1024, 1) |
float32 |
| input | targets |
(batch, 1) |
int32 |
| input | target_length |
(batch,) |
int32 |
| input | input_states_1 |
(2, batch, 640) |
float32 |
| input | input_states_2 |
(2, batch, 640) |
float32 |
| output | outputs |
(batch, 1, 1, 1025) |
float32 |
| output | prednet_lengths |
(batch,) |
int32 |
| output | output_states_1 |
(2, batch, 640) |
float32 |
| output | output_states_2 |
(2, batch, 640) |
float32 |
Vocab size is 1024 (SentencePiece) + 1 blank = 1025. Predictor LSTM has 2 layers with hidden size 640.
Re-exporting from source
To regenerate the ONNX files from the original .nemo checkpoint:
# install NeMo from main (required for att_chunk_context_size support)
pip install "nemo_toolkit[asr] @ git+https://github.com/NVIDIA-NeMo/NeMo.git@main" huggingface_hub
# downloads the .nemo from HF, exports fp32 + int8
python onnx_export.py
License
Same as the original model: NVIDIA Open Model License.
Model tree for eschmidbauer/parakeet-unified-en-0.6b-onnx
Base model
nvidia/parakeet-unified-en-0.6b