Full Blog TOC

Full Blog Table Of Content with Keywords Available HERE

Monday, December 1, 2025

Subscribing To Microsoft Copilot Events


 

In this post we will review how to subscribe to Microsoft Copilot events. 

Notice that the Microsoft Copilot uses a totally different mechanism than the Microsoft Copilot Studio agents, see the Microsoft External Threat Detection post.


Create Encryption Key

We start by create an a-symmetric encryption key:


openssl genrsa -out private.key 2048
openssl req -new -x509 -key private.key -out publicCert.cer -days 365
base64 publicCert.cer > publicCertBase64.txt
awk 'NF {printf "%s", $0}' publicCertBase64.txt > cert_clean.txt


Create App Registration

Use Microsoft Entra to create a new App Registration with  permissions:  AiEnterpriseInteraction.Read.All. Notice this permission is under the "Microsoft graph” section.

After adding the permissions to the AppRegistration, click the Grant Admin Consent button for the permissions.

We also add client secret allowing to use the AppRegistration in a script. As far as I could see, there is not GUI available for this, and we must use a script.


Create a Service

To supply a subscribe endpoint that Microsoft would send the messages to, create a public available service using a valid TLS certificate. For example, the endpoint can be:

https://my-site.com/interactions

Notice this endpoint should accept both GET and POST requests.

A very simple example of such endpoint is below.

func (e *Executor) Execute(p web.Parser) interface{} {
log.Info("interactions starting")
validation := p.QueryParam("validationToken")
log.Info("token: %v", validation)

data, err := p.GetBodyAsBytes()
if err != nil {
kiterr.RaiseIfError(err)
}
log.Info("body: %v", string(data))

p.SetHeader("Content-Type", "text/plain")
p.WriteStreamingResponse([]byte(validation))

return nil
}

Call the Subscribe API

Use the following to subscribe to events:

#!/bin/bash

TENANT_ID="12345678-1234-1234-1234-123456789012"
CLIENT_ID="12345678-1234-1234-1234-123456789012"
CLIENT_SECRET="abcdefghijklmnopqrstuvwxyz1234567890abcd"

request_token(){
SCOPE=graph.microsoft.com
curl -s -X POST "https://login.microsoftonline.com/$TENANT_ID/oauth2/v2.0/token" \
-H "Content-Type: application/x-www-form-urlencoded" \
-d "client_id=$CLIENT_ID&scope=https%3A%2F%2F${SCOPE}%2F.default&client_secret=$CLIENT_SECRET&grant_type=client_credentials" \
| jq -r '.access_token'
}

request_subscription(){
curl -H "Authorization: Bearer $ACCESS_TOKEN" \
-H "Content-Type: application/json" \
-d '
{
"changeType": "created,deleted,updated",
"notificationUrl": "https://my-site.com/interactions",
"resource": "/copilot/interactionHistory/getAllEnterpriseInteractions",
"includeResourceData": true,
"encryptionCertificate": "LS0tLS1CRUdJTiBDRjhgjkhgkjhgkjhJKJHKJLHLKJHLKJHKJLlkjhlkjhlkjhlkjhlkjhkojghlkjhlkjhlkjhlkjhlkjKJLHLKJHLKJHLKJH8769869876IUGHKLJHKJLYH876Y87H78YH87BN87HKJBJKHGKJLGKJLHlkjhlkjhkljhkjhlkjhlhjkEdVeElUQWZCZ05WQkFvTQpHRWx1ZEdWeWJtVjBJRmRwWkdkcGRITWdVSFI1SUV4MFpEQWVGdzB5TlRFeE1qY3hNakkzTURoYUZ3MHlOakV4Ck1qY3hNakkzTURoYU1FVXhDekFKQmdOVkJBWVRBa2xNTVJNd0VRWURWUVFJREFwVGIyMWxMVk4wWVhSbE1TRXcKSHdZRFZRUUtEQmhKYm5SbGNtNWxkQ0JYYVdSbmFYUnpJRkIwZVNCTWRHUXdnZ0VpTUEwR0NTcUdTSWIzRFFFQgpBUVVBQTRJQkR3QXdnZ0VLQW9JQkFRRFB4Ny8wVzc4N0NLUUh0dHMyVDBoL25LZ0o1ejArb1ZHeFFzcFhSWnlnCnBuanpETkdqUjBtWGFVU2RTZ2JWNW05MDMrNnhqbS9LbHpuTlltOTdoUjJNcnBFSXd1OVVYaWhxU1FTS1ZVcTkKbDk0OVEzME5PK29lT0Z4K3huOC9ycGFMVmpxUzIzR3VUV09Ka3p2aktPeXVnV1BRN3FBazgrdjQ3NjdVUkVvYQpJV2l3aXBIVW4rajBMOTVDTEtFOUZQUXdLMkUzNnZrdWNzd1krSGh5bm45N1piSGszVUM3NXd1QlYwTWVyT0o2CjVQdTFYQUVPZ2JnSFFVUEhuVkViT05MdkNwSUl1MHZlZDZFZmRQbVlzTk1IK2xHSlBOZnFOemRYSEZYSXE4VWMKbHdjbDlPRllUb0dMSEdHWTJiRWpzNWxFUjN1OWtLNFlvc1llUFc2ZmJ3NHhBZ01CQUFHalV6QlJNQjBHQTFVZApEZ1FXQkJUbW45UTBBcmFtVFNTK0phbWtIbzR3eVVVSDd6QWZCZ05WSFNNRUdEQVdnQlRtbjlRMEFyYW1UU1MrCkphbWtIbzR3eVVVSDd6QVBCZ05WSFJNQkFmOEVCVEFEQVFIL01BMEdDU3FHU0liM0RRRUJDd1VBQTRJQkFRREwKRVh2cnUxb0NKNXlERVc2Njc3RlRuQWt5bitheWJqQXBaVmRiRi9vMXZyZWZKWHVBVzdnZ09WZjBrT2xCN2U0WgoyQW0rUnU1bmNiRXdBN0o0L2N0WWlLdVByLzA4U0NjTnp6ZGp6RG9qem5wL1ZadnRiYXo5NGlVOE52YmRyWXBkCkVnb1o1RVk3YzZpQW9JNDlGK2ZNOGZLR3FrL09oVDA0dUNuWk1SUFpFR0lob1dBR1J0ODg1R1VXcVNEdzJDYVAKT3F6eU5WeS8vMFpWQm40dTBER3VjQjVLVkp0Smh0MUNrRTlzeXJGV3IrSTFxTkltMkZoN3pyR1diSWRPL2gvMgpIOEFKY0xEM3QvdzNuZGUrdWl3dnFMbTVhUTcwS0k4Q2ZoZk5Mam9WcmUxTFMwK1ZxRjNlOEl6cXFtSEFQLytJCjk0aDFsOEMreVU5MHFxa3E4OFE5Ci0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K",
"encryptionCertificateId": "my-id",
"expirationDateTime": "2025-11-27T14:00:00.0000000Z",
"clientState": "my-state"
}
' \
"https://graph.microsoft.com/v1.0/subscriptions"
}

ACCESS_TOKEN=$(request_token_graph)
request_subscription

Notice the subscription is limit to up to 2 days in the future, and should be renew to continue to get events.


The encryption certification is the content of the cert_clean.txt that we've created.


Decryption of the Messages

A simple bash to handle the encrypted parts of the messages is below.


dataKey_base64=$(jq -r '.value[].encryptedContent.dataKey' event/event.json)
encrypted_data_base64=$(jq -r '.value[].encryptedContent.data' event/event.json)
dataSignature_base64=$(jq -r '.value[].encryptedContent.dataSignature' event/event.json)

# Decode the base64-encoded symmetric key
echo "$dataKey_base64" | base64 --decode > encrypted_key.bin

# Decrypt the symmetric key using your RSA private key with OAEP padding
openssl pkeyutl -decrypt -inkey key/private.key -pkeyopt rsa_padding_mode:oaep -in encrypted_key.bin -out symmetric_key.bin

# Extract first 16 bytes of symmetric key as IV (hex)
iv=$(xxd -p -l 16 symmetric_key.bin)

# Decode encrypted data
echo "$encrypted_data_base64" | base64 --decode > encrypted_data.bin

# Decrypt using AES-CBC with PKCS7 padding
openssl enc -aes-256-cbc -d -in encrypted_data.bin -out decrypted_data.json \
-K $(xxd -p -c 256 symmetric_key.bin) \
-iv "$iv"


Final Note

As expected for a Microsoft API this is a complicated method to get data. 


Why is double encryption of the messages required? We are already using TLS.

Why can't we subscribe forever?


Anyway, eventually it works and can be used to store the agents interactions. 
Have fun.

Monday, November 24, 2025

CI/CD in a Shell


 

Recently I had to create a CI/CD for a new project whose source repository was in bitbucket. There are standard methods to handle this, using triggers from bitbucket, AWS CodeBuild, and AWS CodePipeline. However, I had only read permissions from the bitbucket and hence was limited in my ability to use the standard tools. I've decided to create CI/CD in a bash and surprisingly I've found it exteremly simple as well as lower cost and faster that the standard tools. I am aware of the downside of using scripts for such processes, such as lake of visibility, redundancy, and standards, but still the result was so good I think startup project should definitely consider it.

Listed below are the shell based CI/CD components.


The Poll Script

The poll script is running on a t3a.nano EC2 whose price is ~3$/month.

It polls the bitbucket repository every 5 minutes, and once a change on the deployment related branch is located, it starts the builder EC2 VM, and runs the build and deploy script.

#!/bin/bash

set -eE

instanceId=""
publicIp=""
intervalSeconds=300

cleanup() {
if [ -n "${instanceId}" ]; then
echo "Stopping instance: ${instanceId}"
if ! aws ec2 stop-instances --instance-ids "${instanceId}"; then
echo "Warning: Failed to stop instance ${instanceId}. Will retry on next run."
else
echo "Instance stopped successfully."
fi
instanceId=""
fi
}

restart_script() {
echo "Command '$BASH_COMMAND' failed with exit code $?"
cleanup
echo "Restarting soon..."
sleep ${intervalSeconds}
exec "$0" "$@"
}

trap 'restart_script "$@"' ERR


runBuild(){
trap cleanup RETURN

instanceId=$(aws ec2 describe-instances \
--filters "Name=tag:Name,Values=my-builder-vm" \
--query "Reservations[*].Instances[*].InstanceId" \
--output text)

echo "Starting instance: ${instanceId}"
aws ec2 start-instances --instance-ids ${instanceId}

echo "Waiting for instance to be in 'running' state..."
aws ec2 wait instance-running --instance-ids ${instanceId}

publicIp=$(aws ec2 describe-instances \
--instance-ids ${instanceId} \
--query "Reservations[0].Instances[0].PublicIpAddress" \
--output text)

echo "Running build remote"
ssh -o StrictHostKeyChecking=no ec2-user@${publicIp} /home/ec2-user/build/my-repo/deploy/aws/production/deploy.sh

cleanup
echo "Build done"
}

checkOnce(){
echo "Check run time: $(date)"
commitFilePath=/home/ec2-user/build/last_commit.txt
latestCommit=$(git ls-remote git@bitbucket.org:my-project/my-repo.git my-deploy-branch | awk '{print $1}')
echo "Latest commit: ${latestCommit}"

lastCommit=$(cat ${commitFilePath} 2>/dev/null || echo "")
echo "Last deployed: ${lastCommit}"

if [ "${latestCommit}" != "${lastCommit}" ]; then
echo "New commit detected, starting build"
runBuild
echo "${latestCommit}" > ${commitFilePath}
echo "last commit updated"
else
echo "No new commits"
fi
}

while true; do
checkOnce
sleep ${intervalSeconds}
done


To make this script part of the poller VM instance startup, use the following:


sudo cat <<EOF > /etc/systemd/system/poll.service
[Unit]
Description=Poll Script Startup
After=network.target

[Service]
Type=simple
ExecStart=/home/ec2-user/build/poll.sh
Restart=on-failure
User=ec2-user
WorkingDirectory=/home/ec2-user/build
StandardOutput=append:/home/ec2-user/build/output.txt
StandardError=append:/home/ec2-user/build/output.txt

[Install]
WantedBy=multi-user.target
EOF


sudo systemctl daemon-reload
sudo systemctl enable poll.service # auto-start on boot
sudo systemctl start poll.service # start immediately


The Build Script - Step 1

The build script is running on c6i.4xlarge EC2 whose price is ~500$/month, but I don't care since this EC2 instance is running only during the deployment itself, so the price is very low here as well.


The script runs on the repository itself, which I've manually cloned once after the EC2 creation. It only pulls the latest version and runs another "step 2" script to handle the build. The goal is to be able to accept changes into "step 2" script as part of the git pull.


#!/bin/bash
set -e

cd /home/ec2-user/build/my-repo
git checkout my-deploy-branch
git pull

./deploy_step2.sh


The Build Script - Step 2

The "step 2" script does the actual work: 

  1. Increments the build number
  2. Builds the docker images
  3. Login to the ECR
  4. Push the images to ECR
  5. Push a new tag to the GIT
  6. uses `helm upgrade` to upgrade the production deployment.


Notice that the EC2 uses a role that enables it to access the ECR and the EKS without user and password, for example:

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ecr:GetAuthorizationToken",
"ecr:BatchCheckLayerAvailability",
"ecr:CompleteLayerUpload",
"ecr:UploadLayerPart",
"ecr:InitiateLayerUpload",
"ecr:PutImage"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"eks:DescribeCluster"
],
"Resource": "*"
}
]
}


The script is:

#!/bin/bash
set -e

export AWS_ACCOUNT=123456789012
export AWS_REGION=us-east-1
export AWS_DEFAULT_REGION=${AWS_REGION}
export EKS_CLUSTER_NAME=my-eks

rootFolder=/home/ec2-user/build
buildVersionFile=${rootFolder}/build_number.txt

if [[ -f "${buildVersionFile}" ]]; then
lastBuildNumber=$(cat "${buildVersionFile}")
else
lastBuildNumber=1000
fi
newBuildNumber=$((lastBuildNumber + 1))
echo "${newBuildNumber}" > ${buildVersionFile}
echo "Build number updated to: ${newBuildNumber}"

./build_my_images.sh

aws ecr get-login-password --region ${AWS_REGION} | docker login --username AWS --password-stdin ${AWS_ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com
RemoteTag=deploy-${newBuildNumber} ./push_images_to_ec2.sh

newTag=deploy-${newBuildNumber}
git tag ${newTag}
git push origin ${newTag}

DEPLOY_VERSION=":${newTag}" ./helm_deploy.sh

echo "I did it again!"


Final Note

This build system is super fast. Why? Because it uses local cache for the docker images. This means we do not require a docker proxy to cache the images, which also makes it cheap.

To sum: don't use this for a big project, but you can use it for startups for sure.




Monday, November 10, 2025

Microsoft External Threat Detection


 


In this post we review the steps to create an external security provider to protect the Microsoft copilot studio based Agents.

Most of this post is based on this article.

Before starting, prepare yourself. Following Microsoft best practice, they've made it a super complex process, but in the end it it working, so that's good.


Provide a service

We start by implementing a service following this guide.

In general this service should provide 2 endpoints: /validate and /analyze-tool-execution.

The /validate endpoint is used only to check the service health and integration with Microsoft Authentication. For this post we will not implement Microsoft Authentication validation. Hence a simple implementation of the /validate is:



type ResponseSuccess struct {
IsSuccessful bool `json:"isSuccessful"`
Status string `json:"status"`
}

type Executor struct {
}

func (e *Executor) Execute(p web.Parser) interface{} {
log.Info("validate starting")
auth := p.GetHeader("Authorization")
log.Info("auth: %v", auth)
log.Info("validate done")
return &ResponseSuccess{
IsSuccessful: true,
Status: "OK",
}
}



The /analyze-tool-execution endpoint is activated in each step before the copilot agent invokes any action, and should approve or reject the action within 1 second (good luck with that). A simple example of implementation is:



type ResponseAllow struct {
BlockAction bool `json:"blockAction"`
}

type Executor struct {
}

func (e *Executor) Execute(p web.Parser) interface{} {
log.Info("analyze tool execution starting")

inputBytes, err := p.GetBodyAsBytes()
kiterr.RaiseIfError(err)

auth := p.GetHeader("Authorization")
log.Info("auth: %v", auth)
tenantId := kitjwt.GetJwtValue(auth, "tid")
applicationRegistrationId := kitjwt.GetJwtValue(auth, "appid")

log.Info("tenantId: %v", tenantId)
log.Info("applicationRegistrationId: %v", applicationRegistrationId)
log.Info("action description: %v", string(inputBytes))


log.Info("analyze tool execution done")
return &ResponseAllow{
BlockAction: false,
}
}

Once the service is implemented, deploy it and provide it with a valid TLS certificate. For the rest of this post we assume it is available in https://external.provider.com.


Register the Domain

Once the service is ready we need to register the domain in entra.microsoft.com.




Notice that as part of the process Microsoft requires you to prove you are the owner of the domain, so you need to add TXT record to the DNS server with a value specified by Microsoft.


App Registration

Create a new AppRegistration in entra.microsoft.com.
Then edit the AppRegistration and under "Expose an API" add the URL https://external.provider.com.

Next, edit the AppRegistration and under Certificates & secrets, select the Federated credentials tab, and add a new credential.
Scenario: Other
Issuer: https://login.microsoftonline.com/55fb1683-57de-46d1-8896-f9f3b07b549f/v2.0
Type: Explicit

The get the "Value" you need to run the following script:

# YOUR TENANT ID HERE
$guid = [Guid]::Parse("55fb1683-57de-46d1-8896-xxxxxxxx")
$base64Url = [Convert]::ToBase64String($guid.ToByteArray()).Replace('+','-').Replace('/','_').TrimEnd('=')
Write-Output $base64Url

# YOUR ENDPOINT ID HERE
$endpoint = "https://external.provider.com/analyze-tool-execution"
$base64Url = [Convert]::ToBase64String([Text.Encoding]::UTF8.GetBytes($endpoint)).Replace('+','-').Replace('/','_').TrimEnd('=')
Write-Output $base64Url


This script outputs 2 values, use them to create the following

/eid1/c/pub/t/FIRST_LINE_OUTPUT/a/m1WPnYRZpEaQKq1Cceg--g/SECOND_LINE_OUTPUT
 

Enable The Threat Detection

In https://admin.powerplatform.microsoft.com enable the threat detection.




Final Note

As promised, the super complex process is now done, and agents related events start streaming into the service which can approve or block them.


Monday, November 3, 2025

Microsoft AI Agent

 

In this post we will create an agent using Microsoft Copilot Studio.


Disclaimer About Microsoft

I've known Microsoft for more than 30 years. It used to be a monopoly company with good products, but overtime they lost their path and vision. What left of it is a monopoly company with bad products. Still, as a monopoly company Microsoft can force the market to use their new products even if they're bad and expensive. An good example for this is the creation of AI  agent using Microsoft Copilot studio.


License

Unlike other providers, to work with Microsoft products you need a license, regardless of the usage amount. This license is extremely compared with other providers. In case you only want to check abilities, and not willing to pay yet, check if you're eligible for the Microsoft E5 developer program.


Copilot Studio

Open the Microsoft Copilot Studio site, select Agents, and create a new agent.



Now we can use an AI to create our agent by simply describing the agent, or we can configure it manually.






Once we click on Create agent, the agent is ready, and we can test it.



We can also configure MCP server to be used under the tools section, for example I've used the public Docusign MCP server.






Once we're done, we can publish the agent, and since Microsoft is a monopoly company where office is the standard application for almost all companies, you can place the agent in an easily accessed location such as Microsoft Teams.


Purview Audit

A must-have method of an agent service is to track the chats, and tune its configuration to make it more useful. To track the chats we can use Microsoft Purview.

In Microsoft Purview site, select Solutions, Audit, and the run a new search filtered by time, and optionally filtered by RecordType=CopilotInteraction.





After some time between 5 minutes and 5 hours (it is Microsoft, so have no high expectations), you will get a search results record.




Final Note

We've seen how to use Microsoft products to create an agent and to track the agent actions. There are other related products that should be used to get a complete solution for agents creation, such as: Microsoft PowerApps, Microsoft PowerApps Admin, Microsoft Dataverse. And yes, you will need to pay for each one of these regardless of the usage amount. Have fun.



Tuesday, October 21, 2025

AWS Bedrock Agent



In this post we show creation of Agent in AWS bedrock and a simple text chat application using python boto3 library.


Create Agent in AWS Console


First open the AWS console, navigate to the AWS bedrock service, and click on Agents.



Click on Create Agent, enter a name for it, and click Create.

For our demo, we switch to the cheapest available model:



Fill in instructions for the agent:

Now click on Save and Exit.

Click on Prepare to create a draft of the agent:



We can now test the agent is working as expected, or else update the agent instructions.



Now we create an alias for the agent, which is a published version of the agent.




Chat using python boto3

Use the following code to run a simple chat with the agent.

import uuid

import boto3

REGION = "us-east-1"
AGENT_ID = "AB5BQ7PVAL"
AGENT_ALIAS_ID = "9BII1XBJP9"

client = boto3.client('bedrock-agent-runtime', region_name=REGION)

session_id = str(uuid.uuid4())

print("Chat Starting")

while True:
user_input = input("Enter prompt:")
if user_input.lower() in ['exit', 'quit']:
print("Bye")
break

response = client.invoke_agent(
agentId=AGENT_ID,
agentAliasId=AGENT_ALIAS_ID,
sessionId=session_id,
inputText=user_input,
)

response_body = response['completion']

print("Answer:", end=" ", flush=True)
for chunk in response_body:
if 'chunk' in chunk:
print(chunk['chunk']['bytes'].decode("utf-8"), end="", flush=True)
print()


An example of this chat is below.



Final Note

This is a very simple example of the AWS bedrock agent activation. Other agent properties include guardrails, agent memory across sessions, and multi-agent configuration.

Best practices for agents creation can be found here.


Sunday, October 12, 2025

GO Embed

 



In this post we review the GO embed annotation and its implications.


Sample Code


package main

import (
"embed"
_ "embed"
"fmt"
)

//go:embed hello.txt
var textFile string

//go:embed hello.txt
var binaryFile []byte

//go:embed data1 data2
var files embed.FS

func main() {
fmt.Printf("%v bytes:\n%v\n", len(binaryFile), textFile)

entries, err := files.ReadDir(".")
if err != nil {
panic(err)
}
for _, entry := range entries {
fmt.Printf("%v dir: %v\n", entry.Name(), entry.IsDir())
}
}


Implications

The GO embed is a simple way of adding files as part of the GO compiled output binary. It serves as an aletnative to making this files available to the application in other mannger, such as supplying the files as part of a docker image, or mounting the files using a kubernetes ConfigMap.

Notice the files are added as part of the binary, so embedding large files means a larger output binary.


Embed Methods

The are 3 methods to embed a file.

First we can add a file as a string. In such a case we should add the explicit embed import:

_ "embed"


Second we can add the file as bytes array, this is very similar to the first method.


Third we can include a set of folders as a virtual file system. The annotation includes the list of folders to be included. There are special handling for files starting with a dot, see more about this in here.


Final Note

While embed is a simple way to add files, it should be used only if we're sure we will not want to change the files in an active running deployment.






Sunday, October 5, 2025

How To Improve LLM Inference Performance


 

In this post we will review possible changes to a LLM inference code to make it run faster and use less GPU memory.


LLM inference is the usage of a trained model for a new data and producing a classification or a prediction. This is usually the production time usage of the model we've selected and possibly fine-tuned for the actual data stream. 


The inference runs the following steps:

  1. Load the model to the memory once in the process startup
  2. Get an input of a single or preferably a batch of inputs 
  3. Forward calculation on the neural network and produce a result

The LLM performance term is actually used for both different subjects:
  1. The precision of the LLM such as false-positive, false-negative
  2. The GPU memory usage and time for the inference process
We will see later that while these are two different goals, they are actually intertwined.

Below is a sample code of model inference where we can see the model loading and the model inference.

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

tokenizer = AutoTokenizer.from_pretrained("ProtectAI/deberta-v3-base-prompt-injection-v2")
model = AutoModelForSequenceClassification.from_pretrained("ProtectAI/deberta-v3-base-prompt-injection-v2")

classifier = pipeline(
"text-classification",
model=model,
tokenizer=tokenizer,
truncation=True,
max_length=512,
device=torch.device(torch_device),
)

results = classifier(text)

# analyze the results
...


Let examine how can we improve the performance of the inference

Compile The Model

The first and simplest change is to compile the model:

model = torch.compile('cuda')

This simple instruction speeds up the inference up to 2 times faster!

The compile command coverts the mode python code to a compiled code and optimizes the model operations. Notice that the compile command should run only once, right after the model is loaded, and it has a small impact on the process startup time.

For the compile() instruction we will need to make sure we have both python-dev and g++ installed. An example of this in a Dockerfile is:

RUN apt-get update && \
apt-get install -y software-properties-common curl wget build-essential g++ && \
add-apt-repository ppa:deadsnakes/ppa && \
apt-get update && \
apt-get install -y python3.12 python3.12-dev && \
/usr/bin/python3.12 --version



Change The Model Precision

By default the model is using a float32 precision, which means it has full accuracy in the neural network calculations. In most cases using float16 precision would do the work just as well and will consume half of the time and half of the memory. The change from float32 to float16 is called half().


model = model.to('cuda').half()

(Notice that model.half should be called BEFORE the model.compile)


Using half() might case a small increase in the false-positive and false-positive rate, but in most cases it is negligible. 


As a side note, we should mention that we can set the precision to int8 or int4, but this is a less common practice. For the record, here is an analyze of the alternatives from GPT:




Final Note

We have reviewed method of improving the memory footprint and runtime of an LLM. While there are some small implications on the accuracy, these methods should be used as a common practice for any LLM implementation.