Parakeet-TDT-ExecuTorch-CUDA-Windows

Pre-exported ExecuTorch .pte file for Parakeet TDT 0.6B with CUDA-Windows backend (NVIDIA GPU) in fp32 precision (no quantization).

Fast speech-to-text with word-level timestamps and GPU acceleration on Windows.

For the quantized variant (3.2× smaller, 2.5× faster prefill), see Parakeet-TDT-ExecuTorch-CUDA-Windows-Quantized.

Installation

git clone https://github.com/pytorch/executorch/ ~/executorch
cd ~/executorch && pip install .

Build on Windows (PowerShell):

cmake --workflow --preset llm-release-cuda
Push-Location examples/models/parakeet
cmake --workflow --preset parakeet-cuda
Pop-Location

Download

pip install huggingface_hub
huggingface-cli download younghan-meta/Parakeet-TDT-ExecuTorch-CUDA-Windows --local-dir parakeet_cuda_windows

Run

Windows (PowerShell):

.\cmake-out\examples\models\parakeet\Release\parakeet_runner.exe `
    --model_path parakeet_cuda_windows\model.pte `
    --data_path parakeet_cuda_windows\aoti_cuda_blob.ptd `
    --audio_path C:\path\to\audio.wav `
    --tokenizer_path parakeet_cuda_windows\tokenizer.model

Optional flags:

  • --timestamps segment — timestamp granularity: none|token|word|segment|all (default: segment)

Export Command

pip install "nemo_toolkit[asr]"

python examples/models/parakeet/export_parakeet_tdt.py \
    --backend cuda-windows \
    --output-dir ./parakeet_cuda_windows

Cross-compilation requires x86_64-w64-mingw32-g++ on PATH and WINDOWS_CUDA_HOME pointing to the extracted Windows CUDA package. See the Parakeet README for detailed setup steps.

Benchmark (RTX 5080, ~20s audio)

Metric Value
Prefill throughput 1,301 tok/s
Decode throughput 1,325 tok/s
Model load time 4.1 s
Time to first token 202 ms
Total inference 270 ms
Real-time factor 74× real-time
Model size 2,445 MB

More Info

Downloads last month
5
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for younghan-meta/Parakeet-TDT-ExecuTorch-CUDA-Windows

Quantized
(23)
this model