In this post we will review possible changes to a LLM inference code to make it run faster and use less GPU memory.
LLM inference is the usage of a trained model for a new data and producing a classification or a prediction. This is usually the production time usage of the model we've selected and possibly fine-tuned for the actual data stream.
The inference runs the following steps:
- Load the model to the memory once in the process startup
- Get an input of a single or preferably a batch of inputs
- Forward calculation on the neural network and produce a result
The LLM performance term is actually used for both different subjects:
- The precision of the LLM such as false-positive, false-negative
- The GPU memory usage and time for the inference process
We will see later that while these are two different goals, they are actually intertwined.
Below is a sample code of model inference where we can see the model loading and the model inference.
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
tokenizer = AutoTokenizer.from_pretrained("ProtectAI/deberta-v3-base-prompt-injection-v2")
model = AutoModelForSequenceClassification.from_pretrained("ProtectAI/deberta-v3-base-prompt-injection-v2")
classifier = pipeline(
"text-classification",
model=model,
tokenizer=tokenizer,
truncation=True,
max_length=512,
device=torch.device(torch_device),
)
results = classifier(text)
# analyze the results
...
Let examine how can we improve the performance of the inference
Compile The Model
The first and simplest change is to compile the model:
model = torch.compile('cuda')
This simple instruction speeds up the inference up to 2 times faster!
The compile command coverts the mode python code to a compiled code and optimizes the model operations. Notice that the compile command should run only once, right after the model is loaded, and it has a small impact on the process startup time.
For the compile() instruction we will need to make sure we have both python-dev and g++ installed. An example of this in a Dockerfile is:
RUN apt-get update && \
apt-get install -y software-properties-common curl wget build-essential g++ && \
add-apt-repository ppa:deadsnakes/ppa && \
apt-get update && \
apt-get install -y python3.12 python3.12-dev && \
/usr/bin/python3.12 --version
Change The Model Precision
By default the model is using a float32 precision, which means it has full accuracy in the neural network calculations. In most cases using float16 precision would do the work just as well and will consume half of the time and half of the memory. The change from float32 to float16 is called half().
model = model.to('cuda').half()
(Notice that model.half should be called BEFORE the model.compile)
Using half() might case a small increase in the false-positive and false-positive rate, but in most cases it is negligible.
As a side note, we should mention that we can set the precision to int8 or int4, but this is a less common practice. For the record, here is an analyze of the alternatives from GPT:
Final Note
We have reviewed method of improving the memory footprint and runtime of an LLM. While there are some small implications on the accuracy, these methods should be used as a common practice for any LLM implementation.