Introduction
Whisper is a foundation model from OpenAI. You can use this model to convert speech to text (transcription) or to automatically translate text from one language to another (translation).
Foundation models are trained on massive amounts of data and form the basis of more advanced or specialized models. For instance, OpenAI trains the Whisper model with an audio dataset containing more than 680,000 hours and 1.4 trillion words. The massive dataset allows Whisper to learn patterns and relationships when performing natural language processing (NLP) tasks.
In this article, you'll deploy a Rcs cloud GPU stack and install the required libraries to implement OpenAI's Whisper model with Python to transcribe audio and translate text.
Prerequisites
Before you begin:
Deploy a new Ubuntu 22.04 A100 Rcs Cloud GPU Server with at least:
- 80 GB GPU RAM
- 12 vCPUs
- 120 GB Memory
Create a non-root user with
sudo
rights and switch to the account.
Install the FFmpeg Package
The Whisper model requires the FFmpeg package. This package has many useful libraries for processing multimedia content, such as audio and video. Follow the steps below to install FFmpeg:
Choose the appropriate command to install FFmpeg. For Ubuntu and Arch, run the commands below:
Ubuntu or Debian:
console$ sudo apt update $ sudo apt install ffmpeg
Arch Linux:
console$ sudo pacman -S ffmpeg
Use
pip
to install theopenai-whisper
model.console$ pip install openai-whisper
Ensure you've installed the Whisper model by checking its version.
console$ pip show openai-whisper
Output:
Name: openai-whisper Version: 20231117 Summary: Robust Speech Recognition via Large-Scale Weak Supervision Home-page: https://github.com/openai/whisper
Transcribe an Audio File With Python
In this section, you'll download a sample audio file from Steve Jobs, the visionary co-founder of Apple. Then, you'll use the Whisper model with Python to transcribe the audio file to text. Follow the steps below:
Download the sample
steve-jobs.mp3
file using the Linuxwget
command.console$ wget https://rcs.is/public/doc-assets/new/implementing-audio-transcription-with-translation-genai-series/steve-jobs.mp3
Create a new
transcribe.py
file using a text editor likenano
.console$ nano transcribe.py
Enter the following information into the
transcribe.py
file. In the following file, you're loading the sample audio file from Steve Jobs and transcribing the audio into a text output.pythonimport whisper import IPython audio_file = "steve-jobs.mp3" IPython.display.Audio(audio_file) model = whisper.load_model("medium") result = model.transcribe(audio_file) print(result["text"].strip())
Save and close the file.
Run the
transcribe.py
file.console$ python3 transcribe.py
Verify the following output.
I'm honored to be with you today for your commencement from one of the finest universities in the world. Truth be told, I never graduated from college, and this is the closest I've ever gotten to a college graduation. Today, I want to tell you three stories from ... down the road will give you the confidence to follow your heart even when it leads you off the well-worn path, and that will make all the difference.
Translate Audio File from Spanish to English
In addition to transcribing text, you can use the OpenAI's Whisper model to translate text into different languages. Follow the steps below:
Download a sample
spanish.mp3
file.console$ wget https://rcs.is/public/doc-assets/new/implementing-audio-transcription-with-translation-genai-series/spanish.mp3
Create a new
translate.py
file.console$ nano translate.py
Enter the following information into the
translate.py
file. You're using the Whisper model to load the Spanish audio sample file in the following file. Then, you're defining a task to convert the audio sample to English.pythonimport IPython import whisper audio_file = "spanish.mp3" IPython.display.Audio(audio_file) model = whisper.load_model("medium") result = model.transcribe(audio_file, task = 'translate') print(result["text"].strip())
Run the
translate.py
file.console$ python3 translate.py
Verify the following output.
What do you think artificial intelligence is? What do I think it is? I don't know how to describe it. Something that is not natural, obviously. Artificial intelligence is, through data, introducing an algorithm
Conclusion
In this article, you explored how to use OpenAI's Whisper foundation model to transcribe and translate sample audio files. You started with transcribing an English audio file to English text. Then, you've also translated a Spanish audio sample file to English text.