How to Build Voice Translation Application Using NVIDIA NeMo - Knowledgebase

Introduction

Neural Modules (NeMo) is an open-source toolkit designed to handle conversational AI tasks. It's part of NVIDIA's GPU Cloud (NGC) catalog which consists of a centralized repository of tools, frameworks, and pre-trained models. These models speed up the development, deployment, and management of Artificial Intelligence and high-performance computing workloads. NGC GPU accelerated containers are also an essential part of the NGC catalog pre-configured with optimized software and libraries to take advantage of GPU resources for accelerated performance.

This article explains how to use the NeMo framework in a GPU-accelerated PyTorch container to perform Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) tasks. You are to install and run the PyTorch container, then, use NeMo pre-trained models to convert imported French audio into English audio.

Prerequisites

Before you begin:

Deploy a fresh Ubuntu 22.04 GPU Stack server using the Rcs marketplace application with at least:
- 1/3 GPU
- 20 GB GPU RAM
- 3 vCPUs
- 30 GB Memory
Using SSH, access the server
Create a non-root user with sudo rights
Switch to the non-root user account. Replace sysadmin with your actual user
```
  # su sysadmin
```

Install PyTorch and Access Jupyter Notebook

To use the NeMo framework on a cloud GPU server, install and run the PyTorch GPU container with port binding using Docker. Then, access the Jupyter Notebook service pre-installed in the container as described in the steps below.

Using Docker, install and run the PyTorch GPU container
```
 $ sudo docker run --gpus all -p 9000:8888 -it nvcr.io/nvidia/pytorch:23.09-py3
```
The above command runs the PyTorch GPU-accelerated docker container with the following configurations:
- --gpus all: Allocates all available server GPUs to the Docker container. This ensures that all GPU resources run GPU-accelerated tasks
- -p 9000:8888: Maps the host port 9000 on your server to the container port 8888. This creates a separate Jupyter Notebook access port different from the Rcs GPU Stack Jupyter Lab service that runs on port 8888
- -it: Starts the container in interactive mode with access to its shell
When successful, verify that your server prompt changes to the root container shell
```
 root@4a09da260af2:/workspace#
```

Start Jupyter Notebook as a background process

 # jupyter notebook --ip=0.0.0.0 &

Your output should look like the one below:

     To access the notebook, open this file in a browser:
     file:///root/.local/share/jupyter/runtime/nbserver-369-open.html
 Or copy and paste this URL:
     http://hostname:8888/?token=c5b30aac114cd01d225975a9c57eafb630a5659dde4c65a8

As displayed in the above output, copy the generated token ?token=XXXX to securely access Jupyter Notebook in your web browser.

Using a web browser such as Chrome, access Jupyter Notebook on your public server IP on port 9000 using the generated access token
```
 http://SERVER-IP:9000/?token=YOUR_TOKEN
```

Article Flow

Run the Pre-Trained Models

To use pre-trained models and necessary NeMo functions, import the NeMo modules. Then, initialize the pre-trained models, and perform tasks like audio transcription and text-to-speech synthesis in a Jupyter Notebook session as described below.

Access the Jupyter Notebook web interface
In the middle right corner, click the New dropdown to reveal a dropdown list
Select Python 3 (ipykernel) under the Notebook: category to open a new file
Within the new Jupyter Notebook file, add the following code in a new cell to install the necessary dependency packages
```
 !pip install Cython nemo_toolkit[all] hydra-core transformers sentencepiece webdataset youtokentome pyannote.metrics jiwer ijson sacremoses sacrebleu rouge_score einops unidic-lite mecab-python3 opencc pangu ipadic wandb nemo_text_processing pytorch-lightning
```
Below is what each package represents:
- Cython: A Python module that allows you to write C extensions for Python. It's often used for performance optimization
- nemo_toolkit[all]: A framework for building conversational AI models. The [all] flag installs all available components and NeMo dependencies
- hydra-core: A framework for configuring complex applications. It's used to manage configuration settings in a clean and organized way
- transformers: Works with pre-trained models in Natural Language Processing (NLP), including models like BERT, GPT-2, among others
- sentencepiece: A library that performs text tokenization and segmentation, often used in NLP tasks
- webdataset: Performs efficient data loading and augmentation, it's particularly useful in deep learning workflows
- youtokentome: A library that performs subword tokenization and useful for language modeling tasks
- pyannote.metrics: A toolkit for speaker diarization and audio analysis tasks that contains evaluation metrics for these tasks
- jiwer: A library for computing the Word Error Rate (WER), a common metric used in Automatic Speech Recognition (ASR) and other speech-processing tasks
- ijson: A library for parsing large JSON documents incrementally. It's useful when efficiently working with large data files
- sacremoses: A Python library that performs tokenization, de-tokenization, and various text-processing tasks
- sacrebleu: Evaluates machine translation quality using the BLEU metric
- rouge_score: A library for computing the ROUGE evaluation metric often used in text summarization and machine translation
- einops: A library for tensor operations and reshaping useful when developing deep learning models
- unidic-lite: A morphological analysis dictionary
- mecab-python3: A tokenizer part-of-speech tagger that works in the Python binding for MeCab
- opencc: A library for simplified and traditional Chinese text conversion
- pangu: A Chinese text spacing library for adding spaces between Chinese characters
- ipadic: A morphological analysis dictionary
- wandb: Tracks and visualizes machine learning experiments
- nemo_text_processing: Contains text processing utilities specific to the NVidia NeMo Toolkit
- pytorch-lightning: A lightweight wrapper for PyTorch that simplifies training Deep Learning models
Press Run on the main menu bar or press Ctrl + Enter to install the packages
In a new code cell, import the necessary modules
```
 import nemo
 import nemo.collections.asr as nemo_asr
 import nemo.collections.nlp as nemo_nlp
 import nemo.collections.tts as nemo_tts
 import IPython
```
The above commands import the necessary modules required to run the NeMo pre-trained models. Below is what each module represents:
- nemo: Allows you to access NeMo's functionalities and classes
- nemo.collections.asr: Allows you to access NeMo's ASR-related functionalities and models
- nemo_nlp: Allows to use NeMo's NLP-related tools, models, and utilities
- nemo_tts: Allows you to use NeMo's TTS-related functionalities and models
- IPython: Allows you to interactively run and experiment with NeMo code interactively
Open the NGC NeMo catalog
```
 nemo_asr.models.EncDecCTCModel.list_available_models()
 nemo_nlp.models.MTEncDecModel.list_available_models()
 nemo_tts.models.HifiGanModel.list_available_models()
 nemo_tts.models.FastPitchModel.list_available_models()
```
The above command lists all available models in the following categories:
- Automatic speech recognition
- Encoder-decoder: A Natural Language Processing (NLP) collection part of the MTEnDec category
- Text-to-speech: HifiGan and FastPitch
Based on the catalog, use the following models:
- stt_fr_quartznet15x5: For speech recognition, specific to the French language
- nmt_fr_en_transformer12x2: Translates text from one language to another, specific to the French language
- tts_en_fastpitch: Generates a spectrogram for text input to text-to-speech
- tts_en_lj_hifigan_ft_mixertts: Converts spectrograms into speech for TTS

Initialize the models

 asr_model = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name='stt_fr_quartznet15x5').cuda()
 nmt_model = nemo_nlp.models.MTEncDecModel.from_pretrained(model_name='nmt_fr_en_transformer12x2').cuda()
 spectrogram_generator = nemo_tts.models.FastPitchModel.from_pretrained(model_name='tts_en_fastpitch').cuda()
 vocoder = nemo_tts.models.HifiGanModel.from_pretrained(model_name='tts_en_lj_hifigan_ft_mixertts').cuda()

Wait for at least 15 minutes for the initialization process to complete successfully

Perform Audio Transcription and Synthesis

Download a French audio sample. Replace the link with your desired audio source URL
```
 !wget 'https://lightbulblanguages.co.uk/resources/audio/bonjour.mp3'
 audio_sample = 'bonjour.mp3'
 IPython.display.Audio(audio_sample)
```
The above command downloads the public French MP3 sample audio file bonjour.mp3 and saves it in your Jupyter Notebook working directory. In addition, it uses IPython's Audio widget to display and play the audio file in your Jupyter Notebook session
Transcribe the audio sample to text
```
 transcribed_text = asr_model.transcribe([audio_sample])
 print(transcribed_text)
```
The above command uses the speech recognition model and displays the transcribed text from the audio content

Output:
```
 ['bonjour']
```
Translate the text to English
```
 english_text = nmt_model.translate(transcribed_text)
 print(english_text)
```
The above command uses the pre-trained model to convert the French text to English and displays the converted text

Output:
```
 ['hello']
```
Generate a Spectogram
```
 parseText = spectrogram_generator.parse(english_text[0])
 spectrogram = spectrogram_generator.generate_spectrogram(tokens=parseText)
```
The above command converts the English text into a spectrogram, this is a preprocessing step in text-to-speech synthesis, the spectrogram represents the special characteristics of the generated audio
Convert the spectrogram to audio
```
 audio = vocoder.convert_spectrogram_to_audio(spec=spectrogram)
 audioOutput = audio.to('cpu').detach().numpy()
```
The above command processes the input text to a TTS pipeline to generate the audio output
View the transcribed audio
```
 IPython.display.Audio(audioOutput,rate=22050)
```
Verify that the generated transcribed audio matches your English text at a rate of 22050 Hz

Conclusion

You have built an AI Translator using the NeMo framework pre-trained models and the NGC GPU accelerated container on a Rcs Cloud GPU Server. You transcribed a French audio sample to French text, then, you converted the text to English text and transcribed the text to an English audio sample. Using NeMo modules and pre-trained models from the NGC catalog, the audio speech recognition pipeline becomes efficient and convenient to use.

More Information

For more information, visit the following documentation resources:

Introduction Neural Modules (NeMo) is an open-source toolkit designed to handle conversational AI tasks. It's part of NVIDIA's GPU Cloud (NGC) catalog which consists of a centralized repository of tools, frameworks, and pre-trained models. These models speed up the development, deployment, and management of Artificial Intelligence and high-performance computing workloads. NGC GPU accelerated containers are also an essential part of the NGC catalog pre-configured with optimized software and libraries to take advantage of GPU resources for accelerated performance. This article explains how to use the NeMo framework in a GPU-accelerated PyTorch container to perform Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) tasks. You are to install and run the PyTorch container, then, use NeMo pre-trained models to convert imported French audio into English audio. Prerequisites Before you begin: Deploy a fresh Ubuntu 22.04 GPU Stack server using the Rcs marketplace application with at least: 1/3 GPU 20 GB GPU RAM 3 vCPUs 30 GB Memory Using SSH, access the server Create a non-root user with sudo rights Switch to the non-root user account. Replace sysadmin with your actual user # su sysadmin Install PyTorch and Access Jupyter Notebook To use the NeMo framework on a cloud GPU server, install and run the PyTorch GPU container with port binding using Docker. Then, access the Jupyter Notebook service pre-installed in the container as described in the steps below. Using Docker, install and run the PyTorch GPU container $ sudo docker run --gpus all -p 9000:8888 -it nvcr.io/nvidia/pytorch:23.09-py3 The above command runs the PyTorch GPU-accelerated docker container with the following configurations: --gpus all: Allocates all available server GPUs to the Docker container. This ensures that all GPU resources run GPU-accelerated tasks -p 9000:8888: Maps the host port 9000 on your server to the container port 8888. This creates a separate Jupyter Notebook access port different from the Rcs GPU Stack Jupyter Lab service that runs on port 8888 -it: Starts the container in interactive mode with access to its shell When successful, verify that your server prompt changes to the root container shell root@4a09da260af2:/workspace# Start Jupyter Notebook as a background process # jupyter notebook --ip=0.0.0.0 & Your output should look like the one below: To access the notebook, open this file in a browser: file:///root/.local/share/jupyter/runtime/nbserver-369-open.html Or copy and paste this URL: http://hostname:8888/?token=c5b30aac114cd01d225975a9c57eafb630a5659dde4c65a8 As displayed in the above output, copy the generated token ?token=XXXX to securely access Jupyter Notebook in your web browser. Using a web browser such as Chrome, access Jupyter Notebook on your public server IP on port 9000 using the generated access token http://SERVER-IP:9000/?token=YOUR_TOKEN Run the Pre-Trained Models To use pre-trained models and necessary NeMo functions, import the NeMo modules. Then, initialize the pre-trained models, and perform tasks like audio transcription and text-to-speech synthesis in a Jupyter Notebook session as described below. Access the Jupyter Notebook web interface In the middle right corner, click the New dropdown to reveal a dropdown list Select Python 3 (ipykernel) under the Notebook: category to open a new file Within the new Jupyter Notebook file, add the following code in a new cell to install the necessary dependency packages !pip install Cython nemo_toolkit[all] hydra-core transformers sentencepiece webdataset youtokentome pyannote.metrics jiwer ijson sacremoses sacrebleu rouge_score einops unidic-lite mecab-python3 opencc pangu ipadic wandb nemo_text_processing pytorch-lightning Below is what each package represents: Cython: A Python module that allows you to write C extensions for Python. It's often used for performance optimization nemo_toolkit[all]: A framework for building conversational AI models. The [all] flag installs all available components and NeMo dependencies hydra-core: A framework for configuring complex applications. It's used to manage configuration settings in a clean and organized way transformers: Works with pre-trained models in Natural Language Processing (NLP), including models like BERT, GPT-2, among others sentencepiece: A library that performs text tokenization and segmentation, often used in NLP tasks webdataset: Performs efficient data loading and augmentation, it's particularly useful in deep learning workflows youtokentome: A library that performs subword tokenization and useful for language modeling tasks pyannote.metrics: A toolkit for speaker diarization and audio analysis tasks that contains evaluation metrics for these tasks jiwer: A library for computing the Word Error Rate (WER), a common metric used in Automatic Speech Recognition (ASR) and other speech-processing tasks ijson: A library for parsing large JSON documents incrementally. It's useful when efficiently working with large data files sacremoses: A Python library that performs tokenization, de-tokenization, and various text-processing tasks sacrebleu: Evaluates machine translation quality using the BLEU metric rouge_score: A library for computing the ROUGE evaluation metric often used in text summarization and machine translation einops: A library for tensor operations and reshaping useful when developing deep learning models unidic-lite: A morphological analysis dictionary mecab-python3: A tokenizer part-of-speech tagger that works in the Python binding for MeCab opencc: A library for simplified and traditional Chinese text conversion pangu: A Chinese text spacing library for adding spaces between Chinese characters ipadic: A morphological analysis dictionary wandb: Tracks and visualizes machine learning experiments nemo_text_processing: Contains text processing utilities specific to the NVidia NeMo Toolkit pytorch-lightning: A lightweight wrapper for PyTorch that simplifies training Deep Learning models Press Run on the main menu bar or press CTRL + ENTER to install the packages In a new code cell, import the necessary modules import nemo import nemo.collections.asr as nemo_asr import nemo.collections.nlp as nemo_nlp import nemo.collections.tts as nemo_tts import IPython The above commands import the necessary modules required to run the NeMo pre-trained models. Below is what each module represents: nemo: Allows you to access NeMo's functionalities and classes nemo.collections.asr: Allows you to access NeMo's ASR-related functionalities and models nemo_nlp: Allows to use NeMo's NLP-related tools, models, and utilities nemo_tts: Allows you to use NeMo's TTS-related functionalities and models IPython: Allows you to interactively run and experiment with NeMo code interactively Open the NGC NeMo catalog nemo_asr.models.EncDecCTCModel.list_available_models() nemo_nlp.models.MTEncDecModel.list_available_models() nemo_tts.models.HifiGanModel.list_available_models() nemo_tts.models.FastPitchModel.list_available_models() The above command lists all available models in the following categories: Automatic speech recognition Encoder-decoder: A Natural Language Processing (NLP) collection part of the MTEnDec category Text-to-speech: HifiGan and FastPitch Based on the catalog, use the following models: stt_fr_quartznet15x5: For speech recognition, specific to the French language nmt_fr_en_transformer12x2: Translates text from one language to another, specific to the French language tts_en_fastpitch: Generates a spectrogram for text input to text-to-speech tts_en_lj_hifigan_ft_mixertts: Converts spectrograms into speech for TTS Initialize the models asr_model = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name='stt_fr_quartznet15x5').cuda() nmt_model = nemo_nlp.models.MTEncDecModel.from_pretrained(model_name='nmt_fr_en_transformer12x2').cuda() spectrogram_generator = nemo_tts.models.FastPitchModel.from_pretrained(model_name='tts_en_fastpitch').cuda() vocoder = nemo_tts.models.HifiGanModel.from_pretrained(model_name='tts_en_lj_hifigan_ft_mixertts').cuda() Wait for at least 15 minutes for the initialization process to complete successfully Perform Audio Transcription and Synthesis Download a French audio sample. Replace the link with your desired audio source URL !wget 'https://lightbulblanguages.co.uk/resources/audio/bonjour.mp3' audio_sample = 'bonjour.mp3' IPython.display.Audio(audio_sample) The above command downloads the public French MP3 sample audio file bonjour.mp3 and saves it in your Jupyter Notebook working directory. In addition, it uses IPython's Audio widget to display and play the audio file in your Jupyter Notebook session Transcribe the audio sample to text transcribed_text = asr_model.transcribe([audio_sample]) print(transcribed_text) The above command uses the speech recognition model and displays the transcribed text from the audio content Output: ['bonjour'] Translate the text to English english_text = nmt_model.translate(transcribed_text) print(english_text) The above command uses the pre-trained model to convert the French text to English and displays the converted text Output: ['hello'] Generate a Spectogram parseText = spectrogram_generator.parse(english_text[0]) spectrogram = spectrogram_generator.generate_spectrogram(tokens=parseText) The above command converts the English text into a spectrogram, this is a preprocessing step in text-to-speech synthesis, the spectrogram represents the special characteristics of the generated audio Convert the spectrogram to audio audio = vocoder.convert_spectrogram_to_audio(spec=spectrogram) audioOutput = audio.to('cpu').detach().numpy() The above command processes the input text to a TTS pipeline to generate the audio output View the transcribed audio IPython.display.Audio(audioOutput,rate=22050) Verify that the generated transcribed audio matches your English text at a rate of 22050 Hz Conclusion You have built an AI Translator using the NeMo framework pre-trained models and the NGC GPU accelerated container on a Rcs Cloud GPU Server. You transcribed a French audio sample to French text, then, you converted the text to English text and transcribed the text to an English audio sample. Using NeMo modules and pre-trained models from the NGC catalog, the audio speech recognition pipeline becomes efficient and convenient to use. More Information For more information, visit the following documentation resources: NGC Catalog PyTorch GPU Container Image

Knowledgebase

Categories

Categories

Support

How to Build Voice Translation Application Using NVIDIA NeMo Print

Introduction

Prerequisites

Install PyTorch and Access Jupyter Notebook

Run the Pre-Trained Models

Perform Audio Transcription and Synthesis

Conclusion

More Information

Was this answer helpful?

Related Articles

Support

Knowledgebase

Categories

Categories

Support

How to Build Voice Translation Application Using NVIDIA NeMo Print

Introduction

Prerequisites

Install PyTorch and Access Jupyter Notebook

Run the Pre-Trained Models

Perform Audio Transcription and Synthesis

Conclusion

More Information

Was this answer helpful?

Related Articles

Support

Generate Password