aandrusenko commited on
Commit
a37f219
·
1 Parent(s): c3cb28d

update README.md

Browse files

Signed-off-by: aandrusenko <aandrusenko@nvidia.com>

Files changed (1) hide show
  1. README.md +10 -3
README.md CHANGED
@@ -241,16 +241,23 @@ pipeline_tag: automatic-speech-recognition
241
  | [Model architecture](#model-architecture) | [Model size](#model-architecture) | [Language](#datasets) |
242
  |---|---|---|
243
 
244
- Parakeet-unified-en-0.6b is an English automatic speech recognition (ASR) model based on transducer architecture (RNN-T) combining both offline and streaming inference (up to 160ms latency) in one model. It is trained mostly on the English part of the Granary dataset [3], which contains approximately 250,000 hours of US English (en-US) speech across diverse acoustic conditions. The model transcribes speech to English alphabet, spaces, and apostrophes with punctuation and captalization support.
 
 
 
 
 
 
 
245
 
246
  Why Choose nvidia/parakeet-unified-en-0.6b?
247
 
248
- - **One model for both tasks:** You need to utilize only one unified model for both offline and streaming inference with latency up to 160ms.
249
  - **Better accuracy performance:** The unified model achieves better accuracy performance on the HF ASR Leaderboard datasets compared to the previous transducer-based offline and streaming only models.
250
  - **Streaming chunk size flexibilty:** Enables you to choose the optimal streaming latency (chunk + right context) from 2080ms to 160ms with step of 80ms.
251
  - **Punctuation & Capitalization:** Built-in support for punctuation and capitalization in output text
252
 
253
- This model consists of a 🦜 Parakeet (FastConformer) encoder (jointly trained in offline and streaming modes) with an RNN-T decoder. It is designed for offline and streaming speech-to-text applications where latency can be up to 160ms, such as voice assistants, live captioning, and conversational AI systems. The current inference pipeline supports only buffered streaming (left context is recomputed for each chunk) that can be longer than cache-aware streaming.
254
 
255
  This model is ready for commercial/non-commercial use.
256
 
 
241
  | [Model architecture](#model-architecture) | [Model size](#model-architecture) | [Language](#datasets) |
242
  |---|---|---|
243
 
244
+ Parakeet-unified-en-0.6b is an English automatic speech recognition (ASR) model based on transducer architecture (RNN-T) combining both offline and streaming inference (with a minimum latency of 160ms) in one model. It is trained mostly on the English part of the Granary dataset [3], which contains approximately 250,000 hours of US English (en-US) speech across diverse acoustic conditions. The model transcribes speech to English alphabet, spaces, and apostrophes with punctuation and captalization support.
245
+
246
+ <figure align="center">
247
+ <img src="figures/wer_comparison.png" width="1250" />
248
+ <figcaption>
249
+ Average WER comparison on the HF ASR Leaderboard datasets including offline and streaming inference with different latency values.
250
+ </figcaption>
251
+ </figure>
252
 
253
  Why Choose nvidia/parakeet-unified-en-0.6b?
254
 
255
+ - **One model for both tasks:** You need to utilize only one unified model for both offline and streaming inference with a minimum latency of 160ms.
256
  - **Better accuracy performance:** The unified model achieves better accuracy performance on the HF ASR Leaderboard datasets compared to the previous transducer-based offline and streaming only models.
257
  - **Streaming chunk size flexibilty:** Enables you to choose the optimal streaming latency (chunk + right context) from 2080ms to 160ms with step of 80ms.
258
  - **Punctuation & Capitalization:** Built-in support for punctuation and capitalization in output text
259
 
260
+ This model consists of a 🦜 Parakeet (FastConformer) encoder (jointly trained in offline and streaming modes) with an RNN-T decoder. It is designed for offline and streaming speech-to-text applications where latency can be as low as 160ms, such as voice assistants, live captioning, and conversational AI systems. The current inference pipeline supports only buffered streaming (left context is recomputed for each chunk) that can be longer than cache-aware streaming.
261
 
262
  This model is ready for commercial/non-commercial use.
263