Introduction
Transformer machine learning models are versatile and can be adapted to perform a broad range of tasks. In particular, LLMs with their language-related abilities are valuable for many use cases. To get started with these models, you can use pre-trained models, such as GPT-J, Falcon, or train (fine-tune) a pre-trained model for a specific task.
Inference is the process of applying a model to input data to produce a specific output. To serve a model to users over the internet, build an inference API, and put the model into production. To build an inference API, you need:
An AI model, such as an LLM (pre-trained or fine-tuned)
A web framework to build and serve APIs
A system that's configured to accept and serve requests over the internet
This article explains how to implement each of the building steps, and have a functional inference API running on a Rcs Cloud Server.
Scope
In this article, you will build inference APIs for Hugging Face Transformer models, and examples are based on text generation using the pre-trained GPT-J model with 6 billion parameters. You will also use the smaller GPT Neo 125M model which can be run with 1 GB GPU RAM. However, the output quality is noticeably worse when using smaller pre-trained models.
FastAPI is used to build the API interface. It uses Gunicorn with Uvicorn workers to serve the API because Python-based frameworks such as Django are useful when building full-fledged web applications. When building an API-only application using a Python-based framework, FastAPI is the best choice as used in this article.
Prerequisites
Before you begin:
Deploy a Debian A100 Rcs Cloud GPU Server with at least:
- 16 GB GPU RAM
Using SSH, access the server as a non-root sudo user.
Have basic skills about:
Set up the Server
In this section, set up the Debian server with the necessary packages required to run an inference API using Hugging Face transformer models. You will install tools to run the models and serve the API as described in the steps below.
Install htop and Tmux:
$ sudo apt install -y htop tmux
Download the Conda installer.
$ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
Run the installer.
$ bash Miniconda3-latest-Linux-x86_64.sh
Reply to the Installation prompts as below:
Do you accept the license terms? [yes|no] [no] >>> yes Miniconda3 will now be installed into this location: /home/example-user/miniconda3 - Press ENTER to confirm the location - Press CTRL-C to abort the installation - Or specify a different location below [/home/example-user/miniconda3] >>> PREFIX=/home/example-user/miniconda3 Unpacking payload ... Do you wish the installer to initialize Miniconda3 by running conda init? [yes|no] [no] >>> yes
When the installation is successful, disconnect your SSH session.
$ exit
Start a new SSH session to activate Conda.
$ ssh example-user@SERVER-IP
When logged in, your prompt should look like the one below:
(base) example-user@Test:~$
Upgrade Conda.
$ conda upgrade -y conda
Create a new Conda environment
env1
with the latest Python3 version3.11
.$ conda create -y --name env1 python=3.11
Verify the latest Python3 version before installing
3.11
.Activate the environment
env1
.$ conda activate env1
Upgrade
pip
.$ pip install --upgrade pip
Using Conda, install the CUDA GPU packages.
$ conda install -y -c conda-forge cudatoolkit=11.8 cudnn=8.2
Install Pytorch and related GPU dependencies.
$ conda install -y -c pytorch -c nvidia pytorch=2.0.1 pytorch-cuda=11.8
Set the appropriate paths to initialize Conda.
$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/
Create the activation directory.
$ mkdir -p $CONDA_PREFIX/etc/conda/activate.d
Append paths to the Nvidia tools to the activation shell script.
$ echo 'CUDNN_PATH=$(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)"))' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh $ echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/' > $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
Activate Conda.
$ source $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
Install
tensorflow
,transformers
,huggingface-hub
, Nvidia tools, and dependencies likeaccelerate
.$ pip install nvidia-cudnn-cu11==8.6.0.163 tensorflow==2.12.* transformers==4.30.* huggingface-hub accelerate==0.20.3 xformers==0.0.20
To test GPU integration, run the following Python command.
$ python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
If your output looks like the one below, Python and Conda environments have access to the machine's GPU.
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
Install FastAPI, its
Pydantic
,Uvicorn
, andgunicorn
dependecies.$ pip install fastapi==0.100.0 pydantic==1.10.4 "uvicorn[standard]"==0.22.0 gunicorn==20.1.0
The above command installs,
FastAPI
which is the Python framework used to build the API application.Pydantic
, a Python-based data validation library used to implement custom data types.Uvicorn
, a Python-based low-level web server for asynchronous applications based on the Asynchronous Server Gateway Interface (AGSI) standard.Gunicorn
, a Python-based HTTP server based on the Web Server Gateway Interface (WGSI) standard to serve the API.
Inference API for Text Generation
In this section, set up a basic API to serve a text generation model, and configure it for production use as described below.
Using a text editor such as
Nano
, create a new Python fileapp.py
.$ nano app.py
Add the following code to the file.
# import this transformer to run GPT J 6B from transformers import GPTJForCausalLM # import this transformer to run GPT Neo 125M # from transformers import GPTNeoForCausalLM from transformers import pipeline, AutoTokenizer import torch from fastapi import FastAPI from pydantic import BaseModel from uvicorn.workers import UvicornWorker # use this tokenizer to run GPT J 6B tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6b") # use this tokenizer to run GPT Neo 125M # tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-125m") # use this model for GPT J 6B model = GPTJForCausalLM.from_pretrained("EleutherAI/gpt-j-6b", torch_dtype=torch.float16, low_cpu_mem_usage=True).cuda() # use this model for GPT Neo 125M # model = GPTNeoForCausalLM.from_pretrained("EleutherAI/gpt-neo-125m", torch_dtype=torch.float16, low_cpu_mem_usage=True).cuda() generate_pipeline = pipeline(task="text-generation", model=model, device=0, max_length=500, do_sample=True, num_return_sequences=1, tokenizer=tokenizer) app = FastAPI() class InputPrompt(BaseModel): text: str class GeneratedText(BaseModel): text: str @app.post("/generate", response_model=GeneratedText) async def generate_func(prompt: InputPrompt): output = generate_pipeline(prompt.text) return {"text": output[0]["generated_text"]}
Save and close the file.
The above code imports all necessary packages, defines the tokenizer, model, and uses Hugging Face pipelines to declare a text-generation pipeline using the GPT-J model.
app = FastAPI()
packages the pipeline into an API endpoint and initiates the App.A class is declared to specify the input and output data types. The POST endpoint
generate
is created to accept user input as the body of the HTTP request, and thegenerate_func
processes text entered by the user as a JSON object before returning the generated text to the user.Using Gunicorn, run the App.
$ gunicorn app:app --timeout 1000 --worker-class uvicorn.workers.UvicornWorker --bind 0.0.0.0:8080
The above command starts a server listening on the localhost port
8080
, and servesapp
– the FastAPI app declared in theapp.py
file.Using
curl
, test the application with POST data:$ curl -X 'POST' 'http://localhost:8080/generate' -H 'accept: application/json' -H 'Content-Type: application/json' -d '{"text": "my name is "}'
In the above command,
curl
sends data as a JSON object.-d
specifies the data field, andaccept
specifies the data type the client can accept and understand.Content-Type
specifies the request data type.
Expose the API Server to the Internet
To allow external connections to the API server, and server Internet user requests, open the API server ports through the firewall as described in this section.
By default, UFW is active on Rcs Debian servers, verify the firewall status.
$ sudo ufw status
Allow connections to the API Server port
8080
.$ sudo ufw allow 8080/tcp
Reload Firewall rules to apply changes.
$ sudo ufw reload
In your local terminal session, connect to the inference API over the Internet.
$ curl -X 'POST' 'http://remote.server.ip.address:8080/generate' -H 'accept: application/json' -H 'Content-Type: application/json' -d '{"text": "my name is "}'
To stop the API server, verify its background Job Id.
$ jobs
Output:
[1]+ Running
Kill the Job id.
$ kill %1
In this section you implemented a basic inference API with type safety on the text generation pipeline. To run inference on other types of pipelines and models, modify the pipeline and type definitions as desired.
Serve Fine-tuned Models
The process for serving fine-tuned models similar to serving pre-trained models. Before serving a fine-tuned model, train and save it. Alternatively, you can download and save a fine-tuned model for implementation.
For example, instead of:
# model = GPTJForCausalLM.from_pretrained("EleutherAI/gpt-j-6b", torch_dtype=torch.float16, low_cpu_mem_usage=True).cuda()
Use the path to the saved fine-tuned model as below:
# model = GPTNeoForCausalLM.from_pretrained("Rcs/fine_tuned_gpt_neo_125", torch_dtype=torch.float16, low_cpu_mem_usage=True).cuda()
The model path Rcs/fine_tuned_gpt_neo_125
in the above code is based on the examples for Fine Tuning a Hugging Face Transformer Model on Rcs Cloud GPU.
Multithreading
To run the server with multiple workers, run Gunicorn with the
--workers
option:$ gunicorn app:app --workers 2 --timeout 1000 --worker-class uvicorn.workers.UvicornWorker --bind 0.0.0.0:8080
The above command forks the main thread into
N
processes. When unspecified,N
defaults to1
. For each fork, an instance of the model is replicated in the GPU. Hence, if the model needsX
GB GPU RAM to run, havingN
workers needs aroundX\*N
GB of GPU.When Gunicorn is started with multiple workers, use
ps
to check the system processes to view the different forks.$ ps -ax | grep python
Your output should look like the one below:
21895 pts/1 Dl+ 1:08 /root/miniconda3/envs/env1/bin/python /root/miniconda3/envs/env1/bin/gunicorn app:app --workers 2 --timeout 1000 --worker-class uvicorn.workers.UvicornWorker --bind 0.0.0.0:8080
Each line, as displayed in the output is a fork of the main thread. Run
watch nvidia-smi
at the Linux terminal to monitor the GPU usage in real-time. Verify that each of the forked processes occupies its own GPU memory space.To decide the approximate number of workers, the common rule below is used.
N = number of threads + 1
On regular cloud servers, 1 vCPU is equal to 1 thread while dedicated servers with hyperthreading equate to 2 threads per CPU core. When running multiple workers, verify that the amount of GPU memory is enough to run
N
copies of the model, else, the system outputs an out-of-memory error.In some cases, even with enough GPU, loading a large model with multiple workers can display worker termination warnings as below:
[20153] [WARNING] Worker with pid 20154 was terminated due to signal 9
Run the
dmesg
utility to view a more detailed output.oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-0.slice/session-3.scope,task=gunicorn,pid=20154,uid=0 [ 4412.006670] Out of memory: Killed process 20154 (gunicorn) total-vm:38942992kB, anon-rss:30142812kB, file-rss:2304kB, shmem-rss:0kB, UID:0 pgtables:68700kB oom_score_adj:0
As per the output, loading the worker objects (such as its copy of the model) led to an out-of-memory error and hence, the worker was terminated. In general, the system automatically spawns another process and recovers from this error. If the warnings persist, try increasing the timeout value.
Python functions can be defined using either
def
orasync def
. Usingasync def
together withN
Uvicorn workers createsN
forks of the main thread. Incoming requests are distributed among these N processes. Because the function is asynchronous, it accepts new requests while the slow task (generating ML output) from a previous request is still processing. Each thread sequentially processes the requests assigned to it.Using
def
creates a new thread for each incoming request. Each thread runs in parallel and processes its request. Generating output from a machine learning model is a resource-intensive operation. Therefore, having many concurrent threads leads to resource contention and slows down the system.
Performance Testing
To get a better understanding, test the system performance using both async def
and def
, and different numbers of workers. To study performance differences, use a smaller model, such as GPT-Neo-125m. In this section, test how fast the API server responds to requests
In your remote session, use curl
, and verify how long the server takes to respond to a single HTTP POST request.
$ curl -o /dev/null -X 'POST' -w 'Total: %{time_total}s\n' 'http://localhost:8080/generate' -H 'accept: application/json' -H 'Content-Type: application/json' -d '{"text": "my name is "}'
The above command tests the total time taken to get a response from the server. This helps when testing the response time (without the effect of network latencies) of the server.
To test server responsiveness to concurrent requests, establish two more SSH sessions, and issue the above cURL command from each pane in quick succession per session. While the system is processing the requests, monitor the htop
utility output to view the number of threads and CPU in use, and take note of the time taken for each request to complete.
Conclusion
In this article, you set up an API server to run an inference on Hugging Face Transformer models, and built a text generation API from scratch using a Rcs Cloud GPU Server. You also made code changes to run any pre-trained or fine-tuned models on the server. When serving an API to a wide user base, ensure to adequately address performance, security, and load balancing concerns.
For more implementations, visit the following resources: