Parakeet-TDT-ExecuTorch-CUDA-Windows
Pre-exported ExecuTorch .pte file
for Parakeet TDT 0.6B with
CUDA-Windows backend (NVIDIA GPU) in fp32 precision (no quantization).
Fast speech-to-text with word-level timestamps and GPU acceleration on Windows.
For the quantized variant (3.2× smaller, 2.5× faster prefill), see Parakeet-TDT-ExecuTorch-CUDA-Windows-Quantized.
Installation
git clone https://github.com/pytorch/executorch/ ~/executorch
cd ~/executorch && pip install .
Build on Windows (PowerShell):
cmake --workflow --preset llm-release-cuda
Push-Location examples/models/parakeet
cmake --workflow --preset parakeet-cuda
Pop-Location
Download
pip install huggingface_hub
huggingface-cli download younghan-meta/Parakeet-TDT-ExecuTorch-CUDA-Windows --local-dir parakeet_cuda_windows
Run
Windows (PowerShell):
.\cmake-out\examples\models\parakeet\Release\parakeet_runner.exe `
--model_path parakeet_cuda_windows\model.pte `
--data_path parakeet_cuda_windows\aoti_cuda_blob.ptd `
--audio_path C:\path\to\audio.wav `
--tokenizer_path parakeet_cuda_windows\tokenizer.model
Optional flags:
--timestamps segment— timestamp granularity:none|token|word|segment|all(default:segment)
Export Command
pip install "nemo_toolkit[asr]"
python examples/models/parakeet/export_parakeet_tdt.py \
--backend cuda-windows \
--output-dir ./parakeet_cuda_windows
Cross-compilation requires x86_64-w64-mingw32-g++ on PATH and WINDOWS_CUDA_HOME
pointing to the extracted Windows CUDA package. See the
Parakeet README
for detailed setup steps.
Benchmark (RTX 5080, ~20s audio)
| Metric | Value |
|---|---|
| Prefill throughput | 1,301 tok/s |
| Decode throughput | 1,325 tok/s |
| Model load time | 4.1 s |
| Time to first token | 202 ms |
| Total inference | 270 ms |
| Real-time factor | 74× real-time |
| Model size | 2,445 MB |
More Info
- Downloads last month
- 5
Model tree for younghan-meta/Parakeet-TDT-ExecuTorch-CUDA-Windows
Base model
nvidia/parakeet-tdt-0.6b-v3