Timeline for multilingual version?

#2
by fosple - opened

Congrats to the model release! Super nice to see a dual-use model for offline transcription and streaming.

Do you already have a timeline for a multilingual version (especially German)?

Hi @fosple ,
Thanks so much! We’re really glad the dual-use flexibility is hitting the mark.

Yes, we plan to expand this unified training approach to a multilingual setup. However, I can't provide a reliable timeline just yet.

What streaming latency are you targeting for German? Understanding your requirements will help us better understand the demand for the unified model and prioritize our development.

Hey @aandrusenko ,

Excited to hear that.

I'm targeting a latency of 320ms.

For my use case, prioritizing UI responsiveness (low perceived latency) is more critical than immediate transcription stability. I currently achieve this by utilizing a 4000ms left context, a 320ms chunk size, and a 2000ms right context (look-ahead) with another model which supports German.

The logic is to treat the output generated from the 2000ms look-ahead window as speculative/volatile. I display these tokens immediately to provide instant UI feedback to the user. Once that audio data transitions from the look-ahead window into the stable 320ms processing chunk, I replace the volatile tokens with the finalized, more accurate version. This allows for a very fast user experience while still leveraging the accuracy of a large context window.

Hi @fosple , thank you for sharing!

Just to clarify -- under latency, do you mean only a chunk size of 320ms? And you do not include the right context (look-ahead) here? If so, it looks more like a model responsiveness. You still need to wait 320ms + 2000ms audio from a user to start the inference.

In the case of the Unified model, we define latency as chunk size + right context. For a latency of 320ms, we use an 80ms chunk size and a 240ms right context (left context is fixed as 5600ms, but can be reduced to 4000-3000ms with no big degradation). You need to wait only 320ms of audio from a user to start the inference. The model responsiveness will be 80ms -- equal to the chunk size.

You can also use a chunk size of 320ms with a 720ms right context to improve accuracy and decoding speed. The responsiveness will still be 320ms.

BTW, do you need only German language support for the Unified model? If so, fine-tuning a multilingual Unified model on a single language improves streaming robustness for that language, according to our initial experiments.

In my own tests, Nemotron is still better for streaming.... However, it’s only available in English, and it would be really great if the streaming models were available in other languages as well because they are really fast and accurate... :-) at least the languages spoken in Europe, including Russian and Norwegian by the way, the offline TDT model doesn’t support Norwegian. That would’ve been nice... Of course, I wouldn’t say no to languages like Chinese and Arabic some of the most widely spoken languages in the world haha :-P But really nice models!! ty

NVIDIA org

Hi @altunenes , thank you for the feedback!

The latest nemotron-speech-streaming-en-0.6b checkpoint was trained on 530K hours of data, which is twice as much as the dataset we used for the Unified model (~250K hours). Nemotron's performance can be better in some data domains.

Did you compare the Unified model with the previous Nemotron streaming checkpoint (https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b/tree/nemotron-speech-streaming-jan2026)? The training data here should be the same as for the Unified one.

We can probably retrain the Unified model on the same 530K data, but the main focus now is on the multilingual model.

Hi @altunenes , thank you for the feedback!

The latest nemotron-speech-streaming-en-0.6b checkpoint was trained on 530K hours of data, which is twice as much as the dataset we used for the Unified model (~250K hours). Nemotron's performance can be better in some data domains.

Did you compare the Unified model with the previous Nemotron streaming checkpoint (https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b/tree/nemotron-speech-streaming-jan2026)? The training data here should be the same as for the Unified one.

We can probably retrain the Unified model on the same 530K data, but the main focus now is on the multilingual model.

Thank you. tbh, I’ve been using the new Nemotron model for the past month. I’m no longer using the old one (March 12, 2026 version with onnx: https://huggingface.co/altunenes/parakeet-rs/tree/main/nemotron-speech-streaming-en-0.6b).
So probably that explain the performance difference for my tests. Thank you for the clarification, can't wait for future models. I really like the parakeet models!

Sign up or log in to comment