Introduction
Audio Speech Recognition (ASR) is a pivotal technology that has transformed the way of interaction in the digital world. At its core, ASR allows you to create applications that understand and transcribe human languages into text. Developed applications make use of voice commands to perform transcription and translation tasks, most notably, voice assistant tools use audio speech recognition to generate results.
Whisper is an open-source large neural network model that approaches human-level robustness and accuracy in audio speech recognition for multiple languages. When using Whisper on a Rcs Cloud GPU instance, it is practical to build a high-performance automatic speech recognition system.
This article explains how to build an automatic speech recognition system on a Rcs Cloud GPU server.
Prerequisites
Before you begin:
Deploy a Debian server with at least
- 1/7 GPU
- 10GB GPU RAM
- 2 vCPU
- 15GB memory
Using SSH, access the server
Create a non-root user with sudo privileges
Switch to the user account
$ su example_user
Set Up the Server
To perform audio speech recognition tasks, install the necessary dependencies required by the Whisper model. In addition, set up a development environment such as Jupyter Notebook to run Python codes as described in the steps below.
Install the FFmpeg media processing package
$ sudo apt install ffmpeg
Install the Python virtual environment package
$ sudo apt install python3-venv
Create a new Python virtual environment
$ python3 -m venv audio-env
Activate the virtual environment
$ source audio-env/bin/activate
Update the Pip package manager
$ pip install --upgrade pip
Using pip
, install the PyTorch
, transformers
, and datasets
packages
$ pip install torch transformers datasets
* torch
: Installs the latest PyTorch version
transformers
: Provides thousands of pre-trained models to perform various multi modal tasks on text, vision, and audio
datasets
: Provides efficient data pre-processing for audio data
Install Jupyter Notebook
$ pip install notebook
Allow the Jupyter Notebook port 8888
through the firewall
$ sudo ufw allow 8888/tcp
Start Jupyter Notebook
$ jupyter notebook --ip=0.0.0.0
The above command starts a Jupyter Notebook session that listens for incoming connections on all network interfaces. If the above command fails to run, stop your SSH session, and re-establish a connection to the server.
When successful, an access token displays in your output like the one below:
[I 2023-09-06 02:43:28.807 ServerApp] jupyterlab | extension was successfully loaded.
[I 2023-09-06 02:43:28.809 ServerApp] notebook | extension was successfully loaded.
[I 2023-09-06 02:43:28.809 ServerApp] Serving notebooks from local directory: /root
[I 2023-09-06 02:43:28.809 ServerApp] Jupyter Server 2.7.3 is running at:
[I 2023-09-06 02:43:28.809 ServerApp] http://HOSTNAME:8888/tree?token=639d5e7a34b146eb1b61aa44c419334cc0ede8e8b02e15e6
[I 2023-09-06 02:43:28.809 ServerApp] http://127.0.0.1:8888/tree?token=639d5e7a34b146eb1b61aa44c419334cc0ede8e8b02e15e6
[I 2023-09-06 02:43:28.809 ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[W 2023-09-06 02:43:28.812 ServerApp] No web browser found: Error('could not locate runnable browser').
[C 2023-09-06 02:43:28.812 ServerApp]
To access the server, open this file in a browser:
file:///example_user/.local/share/jupyter/runtime/jpserver-10747-open.html
Or copy and paste one of these URLs:
http://HOSTNAME:8888/tree?token=639d5e7a34b146eb1b61aa44c419334cc0ede8e8b02e15e6
http://127.0.0.1:8888/tree?token=639d5e7a34b146eb1b61aa44c419334cc0ede8e8b02e15e6
Using a web browser such as Firefox, access the Jupyter Notebook using your access token
http://SERVER_IP_HERE:8888/tree?token=TOKEN_HERE
Transcribe Speech in English
Within the Jupyter interface, click New
and select Notebook from the dropdown list
When prompted, click Select to create a new Python3
Kernel file
In the new code cell, update Jupyter and Ipywidgets
!pip install --upgrade jupyter ipywidgets
Import the required libraries
import requests
import json
from transformers import pipeline
from datasets.arrow_dataset import Dataset
from IPython.display import Audio
Define a function to load the sample audio file from a URL
def load_wav(url):
local_path = "test.wav"
with open(local_path, "wb") as a:
resp = requests.get(url)
a.write(resp.content)
ds = Dataset.from_dict({"audio": [local_path]})
return ds[0]["audio"]
Load a sample audio file with the speech in English
url_en = "https://www.signalogic.com/melp/EngSamples/Orig/female.wav"
sample = load_wav(url_en)
The above code uses public speech audio samples. Replace the link with your desired audio file or stream link to use for speech recognition
Verify and play the loaded Audio in your session
Audio(sample)
Create the auto speech recognition pipeline
pipe = pipeline(
"automatic-speech-recognition",
model="openai/whisper-large-v2",
chunk_length_s=30,
device="cuda",
)
In the above code:
model
: Determines the specific Whisper model to use. The code uses openai/whisper-large-v2
for the best possible performance on the recognition accuracy and robustness
chunk_length_s
: Enables the audio chunking algorithm to split the long audio into smaller pieces for processing because the Whisper model works on audio samples with a duration of up to 30 seconds
Run the audio recognition task
prediction = pipe(sample, batch_size=8, return_timestamps=True)["chunks"]
print(json.dumps(prediction, sort_keys=True, indent=4))
For the example audio file used in this article, your output should look like the one below:
[
{
"text": " Perhaps this is what gives the Aborigine his odd air of dignity.",
"timestamp": [
0.0,
3.48
]
},
{
"text": " Turbulent tides rose as much as fifty feet.",
"timestamp": [
3.48,
6.04
]
},
…
]
Transcribe Speech in a Different Language
Load a new sample audio file with French speech fr
url_fr = "https://www.signalogic.com/melp/FrenchSamples/Orig/f_m.wav"
sample = load_wav(url_fr)
Audio(sample)
Create the French transcribe pipeline
pipe = None
pipe = pipeline(
"automatic-speech-recognition",
model="openai/whisper-large-v2",
chunk_length_s=30,
device="cuda",
generate_kwargs={"language":"french","task": "transcribe"},
)
Verify that your target language french
is available in the generate_kwargs
parameter
Run the audio transcribe for French Speech
prediction = pipe(sample, batch_size=8, return_timestamps=True)["chunks"]
print(json.dumps(prediction, sort_keys=True, indent=4))
Your output should look like the one below:
[
{
"text": " La bise et le soleil se disputaient, chacun assurait qu'il \u00e9tait le plus fort,",
"timestamp": [
0.0,
5.0
]
},
{
"text": " quand ils virent un voyageur s'avancer envelopp\u00e9 dans son manteau.",
"timestamp": [
5.0,
9.0
]
},
…
]
Translate Speech from a Different Language to English Text
To perform translation, change the French task from transcribe
to translate
and enable audio recognition
pipe = None
pipe = pipeline(
"automatic-speech-recognition",
model="openai/whisper-large-v2",
chunk_length_s=30,
device="cuda",
generate_kwargs={"language":"french","task": "translate"},
)
Run the audio translation
prediction = pipe(sample, batch_size=8, return_timestamps=True)["chunks"]
print(json.dumps(prediction, sort_keys=True, indent=4))
Your translation output should look like the one below:
[
{
"text": " The abyss and the sun were at war.",
"timestamp": [
0.0,
2.0
]
},
{
"text": " Each one assured that he was the strongest",
"timestamp": [
2.0,
5.0
]
},
…
]
Conclusion
In this article, you built an automatic speech recognition system on a Rcs Cloud GPU server. You applied both English and different language sources using the Whisper model to generate results. The accuracy of speech recognition and translation allows you to achieve high-quality results without any additional fine-tuning. For more information about Whisper, visit the official research page.
Next Steps
To implement more solutions on your Rcs Cloud GPU Server, visit the following resources:
Introduction
Audio Speech Recognition (ASR) is a pivotal technology that has transformed the way of interaction in the digital world. At its core, ASR allows you to create applications that understand and transcribe human languages into text. Developed applications make use of voice commands to perform transcription and translation tasks, most notably, voice assistant tools use audio speech recognition to generate results.
Whisper is an open-source large neural network model that approaches human-level robustness and accuracy in audio speech recognition for multiple languages. When using Whisper on a Rcs Cloud GPU instance, it is practical to build a high-performance automatic speech recognition system.
This article explains how to build an automatic speech recognition system on a Rcs Cloud GPU server.
Prerequisites
Before you begin:
Deploy a Debian server with at least
1/7 GPU
10GB GPU RAM
2 vCPU
15GB memory
Using SSH, access the server
Create a non-root user with sudo privileges
Switch to the user account
$ su example_user
Set Up the Server
To perform audio speech recognition tasks, install the necessary dependencies required by the Whisper model. In addition, set up a development environment such as Jupyter Notebook to run Python codes as described in the steps below.
Install the FFmpeg media processing package
$ sudo apt install ffmpeg
Install the Python virtual environment package
$ sudo apt install python3-venv
Create a new Python virtual environment
$ python3 -m venv audio-env
Activate the virtual environment
$ source audio-env/bin/activate
Update the Pip package manager
$ pip install --upgrade pip
Using pip, install the PyTorch, transformers, and datasets packages
$ pip install torch transformers datasets
* torch: Installs the latest PyTorch version
transformers: Provides thousands of pre-trained models to perform various multi modal tasks on text, vision, and audio
datasets: Provides efficient data pre-processing for audio data
Install Jupyter Notebook
$ pip install notebook
Allow the Jupyter Notebook port 8888 through the firewall
$ sudo ufw allow 8888/tcp
Start Jupyter Notebook
$ jupyter notebook --ip=0.0.0.0
The above command starts a Jupyter Notebook session that listens for incoming connections on all network interfaces. If the above command fails to run, stop your SSH session, and re-establish a connection to the server.
When successful, an access token displays in your output like the one below:
[I 2023-09-06 02:43:28.807 ServerApp] jupyterlab | extension was successfully loaded.
[I 2023-09-06 02:43:28.809 ServerApp] notebook | extension was successfully loaded.
[I 2023-09-06 02:43:28.809 ServerApp] Serving notebooks from local directory: /root
[I 2023-09-06 02:43:28.809 ServerApp] Jupyter Server 2.7.3 is running at:
[I 2023-09-06 02:43:28.809 ServerApp] http://HOSTNAME:8888/tree?token=639d5e7a34b146eb1b61aa44c419334cc0ede8e8b02e15e6
[I 2023-09-06 02:43:28.809 ServerApp] http://127.0.0.1:8888/tree?token=639d5e7a34b146eb1b61aa44c419334cc0ede8e8b02e15e6
[I 2023-09-06 02:43:28.809 ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[W 2023-09-06 02:43:28.812 ServerApp] No web browser found: Error('could not locate runnable browser').
[C 2023-09-06 02:43:28.812 ServerApp]
To access the server, open this file in a browser:
file:///example_user/.local/share/jupyter/runtime/jpserver-10747-open.html
Or copy and paste one of these URLs:
http://HOSTNAME:8888/tree?token=639d5e7a34b146eb1b61aa44c419334cc0ede8e8b02e15e6
http://127.0.0.1:8888/tree?token=639d5e7a34b146eb1b61aa44c419334cc0ede8e8b02e15e6
Using a web browser such as Firefox, access the Jupyter Notebook using your access token
http://SERVER_IP_HERE:8888/tree?token=TOKEN_HERE
Transcribe Speech in English
Within the Jupyter interface, click New and select Notebook from the dropdown list
When prompted, click Select to create a new Python3 Kernel file
In the new code cell, update Jupyter and Ipywidgets
!pip install --upgrade jupyter ipywidgets
Import the required libraries
import requests
import json
from transformers import pipeline
from datasets.arrow_dataset import Dataset
from IPython.display import Audio
Define a function to load the sample audio file from a URL
def load_wav(url):
local_path = "test.wav"
with open(local_path, "wb") as a:
resp = requests.get(url)
a.write(resp.content)
ds = Dataset.from_dict({"audio": [local_path]})
return ds[0]["audio"]
Load a sample audio file with the speech in English
url_en = "https://www.signalogic.com/melp/EngSamples/Orig/female.wav"
sample = load_wav(url_en)
The above code uses public speech audio samples. Replace the link with your desired audio file or stream link to use for speech recognition
Verify and play the loaded Audio in your session
Audio(sample)
Create the auto speech recognition pipeline
pipe = pipeline(
"automatic-speech-recognition",
model="openai/whisper-large-v2",
chunk_length_s=30,
device="cuda",
)
In the above code:
model: Determines the specific Whisper model to use. The code uses openai/whisper-large-v2 for the best possible performance on the recognition accuracy and robustness
chunk_length_s: Enables the audio chunking algorithm to split the long audio into smaller pieces for processing because the Whisper model works on audio samples with a duration of up to 30 seconds
Run the audio recognition task
prediction = pipe(sample, batch_size=8, return_timestamps=True)["chunks"]
print(json.dumps(prediction, sort_keys=True, indent=4))
For the example audio file used in this article, your output should look like the one below:
[
{
"text": " Perhaps this is what gives the Aborigine his odd air of dignity.",
"timestamp": [
0.0,
3.48
]
},
{
"text": " Turbulent tides rose as much as fifty feet.",
"timestamp": [
3.48,
6.04
]
},
…
]
Transcribe Speech in a Different Language
Load a new sample audio file with French speech fr
url_fr = "https://www.signalogic.com/melp/FrenchSamples/Orig/f_m.wav"
sample = load_wav(url_fr)
Audio(sample)
Create the French transcribe pipeline
pipe = None
pipe = pipeline(
"automatic-speech-recognition",
model="openai/whisper-large-v2",
chunk_length_s=30,
device="cuda",
generate_kwargs={"language":"french","task": "transcribe"},
)
Verify that your target language french is available in the generate_kwargs parameter
Run the audio transcribe for French Speech
prediction = pipe(sample, batch_size=8, return_timestamps=True)["chunks"]
print(json.dumps(prediction, sort_keys=True, indent=4))
Your output should look like the one below:
[
{
"text": " La bise et le soleil se disputaient, chacun assurait qu'il \u00e9tait le plus fort,",
"timestamp": [
0.0,
5.0
]
},
{
"text": " quand ils virent un voyageur s'avancer envelopp\u00e9 dans son manteau.",
"timestamp": [
5.0,
9.0
]
},
…
]
Translate Speech from a Different Language to English Text
To perform translation, change the French task from transcribe to translate and enable audio recognition
pipe = None
pipe = pipeline(
"automatic-speech-recognition",
model="openai/whisper-large-v2",
chunk_length_s=30,
device="cuda",
generate_kwargs={"language":"french","task": "translate"},
)
Run the audio translation
prediction = pipe(sample, batch_size=8, return_timestamps=True)["chunks"]
print(json.dumps(prediction, sort_keys=True, indent=4))
Your translation output should look like the one below:
[
{
"text": " The abyss and the sun were at war.",
"timestamp": [
0.0,
2.0
]
},
{
"text": " Each one assured that he was the strongest",
"timestamp": [
2.0,
5.0
]
},
…
]
Conclusion
In this article, you built an automatic speech recognition system on a Rcs Cloud GPU server. You applied both English and different language sources using the Whisper model to generate results. The accuracy of speech recognition and translation allows you to achieve high-quality results without any additional fine-tuning. For more information about Whisper, visit the official research page.
Next Steps
To implement more solutions on your Rcs Cloud GPU Server, visit the following resources:
Stylish Logo Creation with Stable Diffusion on Rcs Cloud GPU
How to Use Vector Embeddings on Rcs Cloud GPU
How to use Hugging Face Transformer Models on a Rcs Cloud GPU server
AI Face Restoration using GFPGAN on Rcs Cloud GPU
How to Use Meta Llama 2 Large Language Model on Rcs Cloud GPU