Exploring Rcs GPU Stack | Generative AI Series
Print
0
Introduction
Llama 2 is an open-source large language model from Hugging Face. You can use the model in your application to perform natural language processing (NLP) tasks.
The Rcs GPU Stack is a preconfigured compute instance with all the essential components for developing and deploying AI and ML applications. In this tutorial, you'll explore the Rcs GPU Stack environment and run a Llama 2 model in a Docker container.
The Rcs GPU stack has many packages that simplify AI model development. Follow the steps below to ensure your environment is up and running:
Check the configuration of the NVIDIA GPU server by running the nvidia-smi command.
console
$ nvidia-smi
Confirm the following output.
output
+-----------------------------------------------------------------------------+| NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.0 ||-------------------------------+----------------------+----------------------+| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC || Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. || | | MIG M. ||===============================+======================+======================|| 0 GRID A100D-4C On | 00000000:06:00.0 Off | 0 || N/A N/A P0 N/A / N/A | 0MiB / 4096MiB | 0% Default || | | Disabled |+-------------------------------+----------------------+----------------------++-----------------------------------------------------------------------------+| Processes: || GPU GI CI PID Type Process name GPU Memory || ID ID Usage ||=============================================================================|| No running processes found |+-----------------------------------------------------------------------------+
The above output confirms that the NVIDIA driver is up and running and the GPU is accessible from the OS. The output also shows that the instance has an A100 GPU attached.
Inspect the docker runtime environment by running the following commands:
Check the Docker version.
console
$ sudodockerversion
Output.
Client: Docker Engine - Community
Version: 24.0.7
API version: 1.43
Go version: go1.20.10
Git commit: afdd53b
Built: Thu Oct 26 09:07:41 2023
OS/Arch: linux/amd64
Context: default
Server: Docker Engine - Community
Engine:
Version: 24.0.7
API version: 1.43 (minimum version 1.12)
Go version: go1.20.10
Git commit: 311b9ff
Built: Thu Oct 26 09:07:41 2023
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.6.24
GitCommit: 61f9fd88f79f081d64d6fa3bb1a0dc71ec870523
runc:
Version: 1.1.9
GitCommit: v1.1.9-0-gccaecfc
docker-init:
Version: 0.19.0
GitCommit: de40ad0
The above output omits some settings for brevity. However, the critical aspect is the availability of the NVIDIA container runtime, which enables Docker containers to access the underlying GPU.
Run an Ubuntu image and execute the nvidia-smi command within the container.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GRID A100D-4C On | 00000000:06:00.0 Off | 0 |
| N/A N/A P0 N/A / N/A | 0MiB / 4096MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
The above output confirms that Docker containers have access to the GPU. In the next step, you'll run LLama2 through a container.
Run Llama2 Model on a Rcs GPU Stack
In this step, you'll launch the Hugging Face Text Generation Inference container to expose the Llama-2-7b-chat-hf parameter model through an API. Follow the steps below:
Digest: sha256:55...45871608f903f7f71d7d
Status: Downloaded newer image for ghcr.io/huggingface/text-generation-inference:1.1.0
78a39...f3e1dca928e00f859
Wait for the container to start and check the logs.
console
$ sudodockerlogs-fhf-tgi
The last few lines below indicate the host is now listening to incoming HTTP connections, and the API is ready.
...
...Connected
...Invalid hostname, defaulting to 0.0.0.0
Run the following curl command to query the API.
console
$ curl127.0.0.1:8080/generate\-XPOST\-d'{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":128}}'\-H'Content-Type: application/json'
Output.
json
{"generated_text":"\n\nDeep learning (also known as deep structured learning) is part of a broader family of machine learning techniques based on artificial neural networks—specifically, on the representation and processing of data using multiple layers of neural networks. Learning can be supervised, semi-supervised, or unsupervised.\n\nDeep-learning architectures such as Deep Neural Networks, Deep Belief Networks, and Deep Reinforcement Learning have been applied to fields including visual recognition, natural language processing, speech recognition, and expert system.\n\nDeep learning has been described as a \"paradigm shift\""}
The output confirms that the LLM is running.
Query the Llama2 Model Using Jupyter Notebook
Use the Python client to invoke the model from a Jupyter Notebook by following the steps below:
Install the HF Text Generation client by running the command below.
console
$ pipinstalltext-generation
Run the Jupyter Lab and retrieve the access token.
Click Python 3 ipykernel under Notebook and paste the following Python code.
python
fromtext_generationimportClientURI='http://localhost:8080'tgi_client=Client(URI)prompt='What is the most important tourist attraction in Paris?'print(tgi_client.generate(prompt,max_new_tokens=100).generated_text.strip())
Run the above code. The LLM responds to your query and displays the following response.
Paris, the City of Light, is known for its iconic landmarks, cultural institutions, and historical significance. As one of the most popular tourist destinations in the world, Paris has a plethora of attractions that draw visitors from all over the globe. While opinions may vary, some of the most important tourist attractions in Paris include:
1. The Eiffel Tower: The most iconic symbol of Paris, the Eiffel Tower
Conclusion
This tutorial walked you through running the LLama 2 model in a container on the Rcs GPU Stack. In the next section, you'll explore other advanced LLM models.
Introduction
Llama 2 is an open-source large language model from Hugging Face. You can use the model in your application to perform natural language processing (NLP) tasks.
The Rcs GPU Stack is a preconfigured compute instance with all the essential components for developing and deploying AI and ML applications. In this tutorial, you'll explore the Rcs GPU Stack environment and run a Llama 2 model in a Docker container.
Prerequisites
Before you begin:
Deploy a new Ubuntu 22.04 A100 Rcs Cloud GPU Server with at least:
80 GB GPU RAM
12 vCPUs
120 GB Memory
Establish an SSH connection to the server.
Create a non-root user with sudo rights and switch to the account.
Create a HuggingFace account.
Create a Hugging Face user access token.
Explore the Rcs GPU Stack environment
The Rcs GPU stack has many packages that simplify AI model development. Follow the steps below to ensure your environment is up and running:
Check the configuration of the NVIDIA GPU server by running the nvidia-smi command.
CONSOLE
Copy
$ nvidia-smi
Confirm the following output.
OUTPUT
Copy
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GRID A100D-4C On | 00000000:06:00.0 Off | 0 |
| N/A N/A P0 N/A / N/A | 0MiB / 4096MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
The above output confirms that the NVIDIA driver is up and running and the GPU is accessible from the OS. The output also shows that the instance has an A100 GPU attached.
Inspect the docker runtime environment by running the following commands:
Check the Docker version.
CONSOLE
Copy
$ sudo docker version
Output.
Client: Docker Engine - Community
Version: 24.0.7
API version: 1.43
Go version: go1.20.10
Git commit: afdd53b
Built: Thu Oct 26 09:07:41 2023
OS/Arch: linux/amd64
Context: default
Server: Docker Engine - Community
Engine:
Version: 24.0.7
API version: 1.43 (minimum version 1.12)
Go version: go1.20.10
Git commit: 311b9ff
Built: Thu Oct 26 09:07:41 2023
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.6.24
GitCommit: 61f9fd88f79f081d64d6fa3bb1a0dc71ec870523
runc:
Version: 1.1.9
GitCommit: v1.1.9-0-gccaecfc
docker-init:
Version: 0.19.0
GitCommit: de40ad0
Display the Docker system information.
CONSOLE
Copy
$ sudo docker info
Output.
Client: Docker Engine - Community
Version: 24.0.7
Context: default
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc.)
Version: v0.11.2
Path: /usr/libexec/docker/cli-plugins/docker-buildx
compose: Docker Compose (Docker Inc.)
Version: v2.21.0
Path: /usr/libexec/docker/cli-plugins/docker-compose
Server:
...
Runtimes: runc io.containerd.runc.v2 nvidia
Default Runtime: runc
...
Docker Root Dir: /var/lib/docker
Debug Mode: false
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
The above output omits some settings for brevity. However, the critical aspect is the availability of the NVIDIA container runtime, which enables Docker containers to access the underlying GPU.
Run an Ubuntu image and execute the nvidia-smi command within the container.
CONSOLE
Copy
$ sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
Output.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GRID A100D-4C On | 00000000:06:00.0 Off | 0 |
| N/A N/A P0 N/A / N/A | 0MiB / 4096MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
The above output confirms that Docker containers have access to the GPU. In the next step, you'll run LLama2 through a container.
Run Llama2 Model on a Rcs GPU Stack
In this step, you'll launch the Hugging Face Text Generation Inference container to expose the Llama-2-7b-chat-hf parameter model through an API. Follow the steps below:
Fill out the Llama 2 model request form.
Use the same email address to sign up for a Hugging Face account and create an access token.
Request access to the Llama-2-7b-chat-hf repository.
Run the following command on your SSH interface to initialize some environment variables. Replace YOUR_HF_TOKEN with the correct Hugging Face token.
CONSOLE
Copy
model=meta-llama/Llama-2-7b-chat-hf
volume=$PWD/data
token=YOUR_HF_TOKEN
Create a data directory to store the model's artifacts in your home directory.
CONSOLE
Copy
$ mkdir data
Run the command below to launch the ghcr.io/huggingface/text-generation-inference:1.1.0 Docker container and initialize the Llama 2 model.
CONSOLE
Copy
$ sudo docker run -d \
--name hf-tgi \
--runtime=nvidia \
--gpus all \
-e HUGGING_FACE_HUB_TOKEN=$token \
-p 8080:80 \
-v $volume:/data \
ghcr.io/huggingface/text-generation-inference:1.1.0 --model-id $model
Output.
Digest: sha256:55...45871608f903f7f71d7d
Status: Downloaded newer image for ghcr.io/huggingface/text-generation-inference:1.1.0
78a39...f3e1dca928e00f859
Wait for the container to start and check the logs.
CONSOLE
Copy
$ sudo docker logs -f hf-tgi
The last few lines below indicate the host is now listening to incoming HTTP connections, and the API is ready.
...
...Connected
...Invalid hostname, defaulting to 0.0.0.0
Run the following curl command to query the API.
CONSOLE
Copy
$ curl 127.0.0.1:8080/generate \
-X POST \
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":128}}' \
-H 'Content-Type: application/json'
Output.
JSON
Copy
{"generated_text":"\n\nDeep learning (also known as deep structured learning) is part of a broader family of machine learning techniques based on artificial neural networks—specifically, on the representation and processing of data using multiple layers of neural networks. Learning can be supervised, semi-supervised, or unsupervised.\n\nDeep-learning architectures such as Deep Neural Networks, Deep Belief Networks, and Deep Reinforcement Learning have been applied to fields including visual recognition, natural language processing, speech recognition, and expert system.\n\nDeep learning has been described as a \"paradigm shift\""}
The output confirms that the LLM is running.
Query the Llama2 Model Using Jupyter Notebook
Use the Python client to invoke the model from a Jupyter Notebook by following the steps below:
Install the HF Text Generation client by running the command below.
CONSOLE
Copy
$ pip install text-generation
Run the Jupyter Lab and retrieve the access token.
$ jupyter lab --ip 0.0.0.0 --port 8890
Output.
http://YOUR_SERVER_HOST_NAME:8890/lab?token=b7ab2bdscb366edsddssfsff0faeb5fa68b6b0cf
Allow port 8890 through the firewall.
CONSOLE
Copy
$ sudo ufw allow 8890
$ sudo ufw reload
Access the Jupyter Lab on a browser and replace YOUR_SERVER_IP with the public IP address of the GPU instance.
https://YOUR_SERVER_IP:8888/lab?token=YOUR_JUPYTER_LAB_TOKEN
Click Python 3 ipykernel under Notebook and paste the following Python code.
PYTHON
Copy
from text_generation import Client
URI='http://localhost:8080'
tgi_client = Client(URI)
prompt='What is the most important tourist attraction in Paris?'
print(tgi_client.generate(prompt, max_new_tokens=100).generated_text.strip())
Run the above code. The LLM responds to your query and displays the following response.
Paris, the City of Light, is known for its iconic landmarks, cultural institutions, and historical significance. As one of the most popular tourist destinations in the world, Paris has a plethora of attractions that draw visitors from all over the globe. While opinions may vary, some of the most important tourist attractions in Paris include:
1. The Eiffel Tower: The most iconic symbol of Paris, the Eiffel Tower
Conclusion
This tutorial walked you through running the LLama 2 model in a container on the Rcs GPU Stack. In the next section, you'll explore other advanced LLM models.