aandrusenko commited on
Commit
1d6eba0
·
1 Parent(s): 2301810

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +261 -3
README.md CHANGED
@@ -1,3 +1,261 @@
1
- ---
2
- license: cc-by-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-4.0
3
+ language:
4
+ - en
5
+ ---
6
+
7
+ # 🦜Parakeet-unified-en-0.6b: Unified ASR model for offline and streaming inference
8
+
9
+ | [Model architecture](#model-architecture)
10
+ | [Model size](#model-architecture)
11
+ | [Language](#datasets)
12
+
13
+ Parakeet-unified-en-0.6b is an English automatic speech recognition (ASR) model based on transducer architecture (RNN-T) combining both offline and streaming inference (up to 160ms latency) in one model. It is trained on the ASRSet dataset, which contains approximately 250,000 hours of US English (en-US) speech across diverse acoustic conditions. The model transcribes speech to English alphabet, spaces, and apostrophes with punctuation and captalization support.
14
+
15
+ Why Choose nvidia/parakeet-unified-en-0.6b?
16
+
17
+ - **One model for both tasks:** You need to utilize only one unified model for both offline and streaming inference with latency up to 160ms.
18
+ - **Better accuracy performance:** The unified model achieves better accuracy performance on the HF ASR Leaderboard datasets compared to the previous transducer-based offline and streaming only models.
19
+ - **Streaming chunk size flexibilty:** Enables you to choose the optimal streaming latency (chunk + right context) from 2080ms to 160ms with step of 80ms.
20
+ - **Punctuation & Capitalization:** Built-in support for punctuation and capitalization in output text
21
+
22
+ This model consists of a 🦜 Parakeet (FastConformer) encoder (jointly trained in offline and streaming modes) with an RNN-T decoder. It is designed for offline and streaming speech-to-text applications where latency can be up to 160ms, such as voice assistants, live captioning, and conversational AI systems. The current inference pipeline supports only buffered streaming (left context is recomputed for each chunk) that can be longer than cache-aware streaming. We plan to add cache-aware streaming support in the future.
23
+
24
+ This model is ready for commercial/non-commercial use.
25
+
26
+ ## License/Terms of Use:
27
+
28
+ Governing Terms: Use of the model is governed by the [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/).
29
+
30
+ ## Deployment Geography:
31
+
32
+ Global
33
+
34
+
35
+ ## Use Case:
36
+
37
+ This model is for transcription of English audio in offline and streaming modes.
38
+
39
+
40
+ ## Release Date:
41
+
42
+ - Hugging Face [04/07/2026] via [https://huggingface.co/nvidia/parakeet-unified-en-0.6b](https://huggingface.co/nvidia/parakeet-unified-en-0.6b)
43
+
44
+
45
+ ## Model Architecture
46
+
47
+ **Architecture Type:** Unified-FastConformer-RNNT
48
+
49
+ The model is based on the FastConformer encoder architecture [1] with 24 encoder layers and an RNNT (Recurrent Neural Network Transducer) decoder. The model was trained jointly in offline and streaming modes. In the offline mode we used standard offline training with full-context self-attention and non-causal convolutions. In the streaming mode we applied chunked self-attention masks (incluing left, middle/chunk and right context) together with Dynamic Chunked Convolutions inside each FastConformer layer [2] to adapt the model to both decoding scenarios. We also introduced a novel mode-consistency regularization loss to further reduce the gap between offline and streaming performance. All the model parameters are shared between offline and streaming modes (encoder, predictor, and joint networks), including initial x8 subsampling with non-causal convolutions.
50
+
51
+ The paper with the details of the model architecture and training will be released soon.
52
+
53
+ **Network Architecture:**
54
+
55
+ - Encoder: Unified FastConformer with 24 layers
56
+ - Decoder: RNNT (Recurrent Neural Network Transducer)
57
+ - Parameters: 600M
58
+
59
+ ## NVIDIA NeMo
60
+
61
+ ## How to Use this Model
62
+
63
+ For now, we provide only inference support for the unified model. We will release the unified training pipeline soon.
64
+
65
+ ### Loading the Model
66
+
67
+ ```python
68
+ import nemo.collections.asr as nemo_asr
69
+ asr_model = nemo_asr.models.ASRModel.from_pretrained(model_name="nvidia/parakeet-unified-en-0.6b")
70
+ ```
71
+
72
+ ### Offline Inference
73
+
74
+ ```python
75
+ output = asr_model.transcribe([wav_file_path])
76
+ print(output[0].text)
77
+ ```
78
+
79
+ ### Streaming Inference
80
+
81
+ For streaming inference you can use statfull chunked RNN-T decoding script from NeMo - [/NeMo/blob/main/examples/asr/asr_chunked_inference/rnnt/speech_to_text_streaming_infer_rnnt.py](https://github.com/NVIDIA-NeMo/NeMo/blob/main/examples/asr/asr_chunked_inference/rnnt/speech_to_text_streaming_infer_rnnt.py)
82
+
83
+ ```bash
84
+ cd NeMo
85
+ python examples/asr/asr_chunked_inference/rnnt/speech_to_text_streaming_infer_rnnt.py \
86
+ model_path=<model_path> \
87
+ dataset_manifest=<dataset_manifest> \
88
+ output_filename=<output_json_file> \
89
+ left_context_secs=<left_context_secs> \ # left context in seconds, 5.6s by default
90
+ chunk_secs=<chunk_secs> \ # chunk size in seconds, 0.56s by default
91
+ right_context_secs=<right_cintext_secs> \ # right context in seconds, 0.56s by default
92
+ att_context_size_as_chunk=true \ # set to true to use chunked self-attention masks
93
+ batch_size=<batch_size>
94
+ ```
95
+
96
+ You can also run streaming inference through the pipeline method, which uses [NeMo/examples/asr/conf/asr_streaming_inference/buffered_rnnt.yaml](https://github.com/NVIDIA-NeMo/NeMo/blob/main/examples/asr/conf/asr_streaming_inference/buffered_rnnt.yaml) configuration file to build end‑to‑end workflows with punctuation and capitalization (PnC), inverse text normalization (ITN), and translation support.
97
+
98
+ ```python
99
+ from nemo.collections.asr.inference.factory.pipeline_builder import PipelineBuilder
100
+ from omegaconf import OmegaConf
101
+
102
+ # Path to the cache aware config file downloaded from above link
103
+ cfg_path = 'buffered_rnnt.yaml'
104
+ cfg = OmegaConf.load(cfg_path)
105
+
106
+ # Pass the paths of all the audio files for inferencing
107
+ audios = ['/path/to/your/audio.wav']
108
+
109
+ # Create the pipeline object and run inference
110
+ pipeline = PipelineBuilder.build_pipeline(cfg)
111
+ output = pipeline.run(audios)
112
+
113
+ # Print the output
114
+ for entry in output:
115
+ print(entry['text'])
116
+ ```
117
+
118
+ ---
119
+
120
+ ### Setting up Streaming Configuration
121
+
122
+ Latency is defined as the sum of the chunk size (middle part) and the right context.
123
+ For the left context we use 5.6s by default (5.6s was used during the model training), but you can try to find the optimal value for better accuracy/speed trade-off.
124
+
125
+ We would recommend to use the following context parameters for different latencies:
126
+
127
+ | Left, s | Chunk, s | Right, s | Latency (C+R), s |
128
+ | :---: | :---: | :---: | :---: |
129
+ | 5.6 | 1.04 | 1.04 | 2.08 |
130
+ | 5.6 | 0.56 | 0.56 | 1.12 |
131
+ | 5.6 | 0.16 | 0.40 | 0.56 |
132
+ | 5.6 | 0.08 | 0.24 | 0.32 |
133
+ | 5.6 | 0.08 | 0.16 | 0.24 |
134
+ | 5.6 | 0.08 | 0.08 | 0.16 |
135
+
136
+ ### Input
137
+
138
+ - Input Type(s): Audio
139
+
140
+ - Input Format(s): wav
141
+
142
+ - Input Parameters: One-Dimensional (1D)
143
+
144
+ - Other Properties Related to Input: Maximum Length in seconds specific to GPU Memory, No Pre-Processing Needed, Mono channel is required. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
145
+
146
+
147
+ ### Output
148
+
149
+ - Output Type(s): Text String in English
150
+
151
+ - Output Format(s): String
152
+
153
+ - Output Parameters: One-Dimensional (1D)
154
+
155
+ - Other Properties Related to Output: No Maximum Character Length, transcribe punctuation and capitalization. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
156
+
157
+
158
+ ## Datasets
159
+
160
+ ### Training Datasets
161
+
162
+ The majority of the training data comes from NVIDIA Riva ASR training set (250k hours) and the English portion of the Granary dataset [3]:
163
+
164
+ - YouTube-Commons (YTC) (109.5k hours)
165
+ - YODAS2 (102k hours)
166
+ - Mosel (14k hours)
167
+ - LibriLight (49.5k hours)
168
+
169
+ In addition, the following datasets were used:
170
+
171
+ - Librispeech 960 hours
172
+ - Fisher Corpus
173
+ - Switchboard-1 Dataset
174
+ - WSJ-0 and WSJ-1
175
+ - National Speech Corpus (Part 1, Part 6)
176
+ - VCTK
177
+ - VoxPopuli (EN)
178
+ - Europarl-ASR (EN)
179
+ - Multilingual Librispeech (MLS EN)
180
+ - Mozilla Common Voice (v11.0)
181
+ - Mozilla Common Voice (v7.0)
182
+ - Mozilla Common Voice (v4.0)
183
+ - People Speech
184
+ - AMI
185
+
186
+ **Data Modality:** Audio and text
187
+
188
+ **Audio Training Data Size:** 530k hours
189
+
190
+ **Data Collection Method:** Human - All audios are human recorded
191
+
192
+ **Labeling Method:** Hybrid (Human, Synthetic) - Some transcripts are generated by ASR models, while some are manually labeled
193
+
194
+ ### Evaluation Datasets
195
+
196
+ The model was evaluated on the HuggingFace ASR Leaderboard datasets:
197
+
198
+ - AMI
199
+ - Earnings22
200
+ - Gigaspeech
201
+ - LibriSpeech test-clean
202
+ - LibriSpeech test-other
203
+ - SPGI Speech
204
+ - TEDLIUM
205
+ - VoxPopuli
206
+
207
+ ## Performance
208
+
209
+ ## ASR Performance (w/o PnC)
210
+
211
+ ASR performance is measured using the Word Error Rate (WER). Both ground-truth and predicted texts are processed using [whisper-normalizer](https://pypi.org/project/whisper-normalizer/) version 0.1.12. The obtained results for other models can be slightly different from the official HF model cards because of the different evaluation machines.
212
+
213
+ The following table show the WER on the [HuggingFace OpenASR leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard) datasets including offline and streaming inference with different latency values:
214
+
215
+
216
+ | Model setup | Offline | 2.08s | 1.12s | 0.56s | 0.40s | 0.32s | 0.24s | 0.16s | 0.08s |
217
+ | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
218
+ | nvidia/parakeet-tdt-0.6b-v2 | 6.04 | 7.99 | 22.83 | 69.55 | 95.12 | — | — | — | — |
219
+ | nvidia/nemotron-speech-streaming-en-0.6b | 6.92 | 7.46 | 6.92 | 7.09 | 9.52 | 7.64 | 8.01 | **7.84** | **8.70** |
220
+ | nvidia/parakeet-unified-en-0.6b | **5.91** | **6.14** | **6.29** | **6.52** | **6.70** | **6.92** | **7.35** | 8.44 | 15.63 |
221
+
222
+
223
+ Parakeet-unified-en-0.6b model outperforms previous NVIDIA transducer-based models in offline and streaming (up to 240ms latency) inference modes. At 160ms latency, the unified model start to degrade because of the ansence of enough right context, yielding slightly to the strong streaming baseline. For 80ms latency we would recommend to use nemotron-speech-streaming-en-0.6b model instead.
224
+
225
+ ## Software Integration
226
+
227
+ **Runtime Engine:** NeMo 25.11, Riva 2.25.0 or higher
228
+
229
+ **Supported Hardware Microarchitecture Compatibility:**
230
+
231
+ - NVIDIA Ampere
232
+ - NVIDIA Blackwell
233
+ - NVIDIA Hopper
234
+ - NVIDIA Volta
235
+
236
+ **Test Hardware:**
237
+
238
+ - NVIDIA V100
239
+ - NVIDIA A100
240
+ - NVIDIA A6000
241
+ - DGX Spark
242
+
243
+ **Preferred/Supported Operating System(s):** Linux
244
+
245
+ ## Ethical Considerations
246
+
247
+ NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
248
+
249
+ Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
250
+
251
+ ## References
252
+
253
+ <!-- [1] [Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition](https://arxiv.org/abs/2312.17279) -->
254
+
255
+ [1] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://arxiv.org/abs/2305.05084)
256
+
257
+ [2] [Dynamic Chunk Convolution for Unified Streaming and Non-Streaming Conformer ASR](https://arxiv.org/abs/2304.09325)
258
+
259
+ [3] [NVIDIA Granary](https://huggingface.co/datasets/nvidia/Granary)
260
+
261
+ [4] [NVIDIA NeMo Framework](https://github.com/NVIDIA/NeMo)