update README.md

Browse files

Signed-off-by: aandrusenko <aandrusenko@nvidia.com>

Files changed (1) hide show

README.md +10 -3

README.md CHANGED Viewed

@@ -241,16 +241,23 @@ pipeline_tag: automatic-speech-recognition
 | [Model architecture](#model-architecture) | [Model size](#model-architecture) | [Language](#datasets) |
 |---|---|---|
-Parakeet-unified-en-0.6b is an English automatic speech recognition (ASR) model based on transducer architecture (RNN-T) combining both offline and streaming inference (up to 160ms latency) in one model. It is trained mostly on the English part of the Granary dataset [3], which contains approximately 250,000 hours of US English (en-US) speech across diverse acoustic conditions. The model transcribes speech to English alphabet, spaces, and apostrophes with punctuation and captalization support.
 Why Choose nvidia/parakeet-unified-en-0.6b?
-- **One model for both tasks:** You need to utilize only one unified model for both offline and streaming inference with latency up to 160ms.
 - **Better accuracy performance:** The unified model achieves better accuracy performance on the HF ASR Leaderboard datasets compared to the previous transducer-based offline and streaming only models.
 - **Streaming chunk size flexibilty:** Enables you to choose the optimal streaming latency (chunk + right context) from 2080ms to 160ms with step of 80ms.
 - **Punctuation & Capitalization:** Built-in support for punctuation and capitalization in output text
-This model consists of a 🦜 Parakeet (FastConformer) encoder (jointly trained in offline and streaming modes) with an RNN-T decoder. It is designed for offline and streaming speech-to-text applications where latency can be up to 160ms, such as voice assistants, live captioning, and conversational AI systems. The current inference pipeline supports only buffered streaming (left context is recomputed for each chunk) that can be longer than cache-aware streaming.
 This model is ready for commercial/non-commercial use.

 | [Model architecture](#model-architecture) | [Model size](#model-architecture) | [Language](#datasets) |
 |---|---|---|
+Parakeet-unified-en-0.6b is an English automatic speech recognition (ASR) model based on transducer architecture (RNN-T) combining both offline and streaming inference (with a minimum latency of 160ms) in one model. It is trained mostly on the English part of the Granary dataset [3], which contains approximately 250,000 hours of US English (en-US) speech across diverse acoustic conditions. The model transcribes speech to English alphabet, spaces, and apostrophes with punctuation and captalization support.
+<figure align="center">
+  <img src="figures/wer_comparison.png" width="1250" />
+  <figcaption>
+    Average WER comparison on the HF ASR Leaderboard datasets including offline and streaming inference with different latency values.
+  </figcaption>
+</figure>
 Why Choose nvidia/parakeet-unified-en-0.6b?
+- **One model for both tasks:** You need to utilize only one unified model for both offline and streaming inference with a minimum latency of 160ms.
 - **Better accuracy performance:** The unified model achieves better accuracy performance on the HF ASR Leaderboard datasets compared to the previous transducer-based offline and streaming only models.
 - **Streaming chunk size flexibilty:** Enables you to choose the optimal streaming latency (chunk + right context) from 2080ms to 160ms with step of 80ms.
 - **Punctuation & Capitalization:** Built-in support for punctuation and capitalization in output text
+This model consists of a 🦜 Parakeet (FastConformer) encoder (jointly trained in offline and streaming modes) with an RNN-T decoder. It is designed for offline and streaming speech-to-text applications where latency can be as low as 160ms, such as voice assistants, live captioning, and conversational AI systems. The current inference pipeline supports only buffered streaming (left context is recomputed for each chunk) that can be longer than cache-aware streaming.
 This model is ready for commercial/non-commercial use.