Full Blog TOC

Full Blog Table Of Content with Keywords Available HERE

Monday, December 1, 2025

Subscribing To Microsoft Copilot Events


 

In this post we will review how to subscribe to Microsoft Copilot events. 

Notice that the Microsoft Copilot uses a totally different mechanism than the Microsoft Copilot Studio agents, see the Microsoft External Threat Detection post.


Create Encryption Key

We start by create an a-symmetric encryption key:


openssl genrsa -out private.key 2048
openssl req -new -x509 -key private.key -out publicCert.cer -days 365
base64 publicCert.cer > publicCertBase64.txt
awk 'NF {printf "%s", $0}' publicCertBase64.txt > cert_clean.txt


Create App Registration

Use Microsoft Entra to create a new App Registration with  permissions:  AiEnterpriseInteraction.Read.All. Notice this permission is under the "Microsoft graph” section.

After adding the permissions to the AppRegistration, click the Grant Admin Consent button for the permissions.

We also add client secret allowing to use the AppRegistration in a script. As far as I could see, there is not GUI available for this, and we must use a script.


Create a Service

To supply a subscribe endpoint that Microsoft would send the messages to, create a public available service using a valid TLS certificate. For example, the endpoint can be:

https://my-site.com/interactions

Notice this endpoint should accept both GET and POST requests.

A very simple example of such endpoint is below.

func (e *Executor) Execute(p web.Parser) interface{} {
log.Info("interactions starting")
validation := p.QueryParam("validationToken")
log.Info("token: %v", validation)

data, err := p.GetBodyAsBytes()
if err != nil {
kiterr.RaiseIfError(err)
}
log.Info("body: %v", string(data))

p.SetHeader("Content-Type", "text/plain")
p.WriteStreamingResponse([]byte(validation))

return nil
}

Call the Subscribe API

Use the following to subscribe to events:

#!/bin/bash

TENANT_ID="12345678-1234-1234-1234-123456789012"
CLIENT_ID="12345678-1234-1234-1234-123456789012"
CLIENT_SECRET="abcdefghijklmnopqrstuvwxyz1234567890abcd"

request_token(){
SCOPE=graph.microsoft.com
curl -s -X POST "https://login.microsoftonline.com/$TENANT_ID/oauth2/v2.0/token" \
-H "Content-Type: application/x-www-form-urlencoded" \
-d "client_id=$CLIENT_ID&scope=https%3A%2F%2F${SCOPE}%2F.default&client_secret=$CLIENT_SECRET&grant_type=client_credentials" \
| jq -r '.access_token'
}

request_subscription(){
curl -H "Authorization: Bearer $ACCESS_TOKEN" \
-H "Content-Type: application/json" \
-d '
{
"changeType": "created,deleted,updated",
"notificationUrl": "https://my-site.com/interactions",
"resource": "/copilot/interactionHistory/getAllEnterpriseInteractions",
"includeResourceData": true,
"encryptionCertificate": "LS0tLS1CRUdJTiBDRjhgjkhgkjhgkjhJKJHKJLHLKJHLKJHKJLlkjhlkjhlkjhlkjhlkjhkojghlkjhlkjhlkjhlkjhlkjKJLHLKJHLKJHLKJH8769869876IUGHKLJHKJLYH876Y87H78YH87BN87HKJBJKHGKJLGKJLHlkjhlkjhkljhkjhlkjhlhjkEdVeElUQWZCZ05WQkFvTQpHRWx1ZEdWeWJtVjBJRmRwWkdkcGRITWdVSFI1SUV4MFpEQWVGdzB5TlRFeE1qY3hNakkzTURoYUZ3MHlOakV4Ck1qY3hNakkzTURoYU1FVXhDekFKQmdOVkJBWVRBa2xNTVJNd0VRWURWUVFJREFwVGIyMWxMVk4wWVhSbE1TRXcKSHdZRFZRUUtEQmhKYm5SbGNtNWxkQ0JYYVdSbmFYUnpJRkIwZVNCTWRHUXdnZ0VpTUEwR0NTcUdTSWIzRFFFQgpBUVVBQTRJQkR3QXdnZ0VLQW9JQkFRRFB4Ny8wVzc4N0NLUUh0dHMyVDBoL25LZ0o1ejArb1ZHeFFzcFhSWnlnCnBuanpETkdqUjBtWGFVU2RTZ2JWNW05MDMrNnhqbS9LbHpuTlltOTdoUjJNcnBFSXd1OVVYaWhxU1FTS1ZVcTkKbDk0OVEzME5PK29lT0Z4K3huOC9ycGFMVmpxUzIzR3VUV09Ka3p2aktPeXVnV1BRN3FBazgrdjQ3NjdVUkVvYQpJV2l3aXBIVW4rajBMOTVDTEtFOUZQUXdLMkUzNnZrdWNzd1krSGh5bm45N1piSGszVUM3NXd1QlYwTWVyT0o2CjVQdTFYQUVPZ2JnSFFVUEhuVkViT05MdkNwSUl1MHZlZDZFZmRQbVlzTk1IK2xHSlBOZnFOemRYSEZYSXE4VWMKbHdjbDlPRllUb0dMSEdHWTJiRWpzNWxFUjN1OWtLNFlvc1llUFc2ZmJ3NHhBZ01CQUFHalV6QlJNQjBHQTFVZApEZ1FXQkJUbW45UTBBcmFtVFNTK0phbWtIbzR3eVVVSDd6QWZCZ05WSFNNRUdEQVdnQlRtbjlRMEFyYW1UU1MrCkphbWtIbzR3eVVVSDd6QVBCZ05WSFJNQkFmOEVCVEFEQVFIL01BMEdDU3FHU0liM0RRRUJDd1VBQTRJQkFRREwKRVh2cnUxb0NKNXlERVc2Njc3RlRuQWt5bitheWJqQXBaVmRiRi9vMXZyZWZKWHVBVzdnZ09WZjBrT2xCN2U0WgoyQW0rUnU1bmNiRXdBN0o0L2N0WWlLdVByLzA4U0NjTnp6ZGp6RG9qem5wL1ZadnRiYXo5NGlVOE52YmRyWXBkCkVnb1o1RVk3YzZpQW9JNDlGK2ZNOGZLR3FrL09oVDA0dUNuWk1SUFpFR0lob1dBR1J0ODg1R1VXcVNEdzJDYVAKT3F6eU5WeS8vMFpWQm40dTBER3VjQjVLVkp0Smh0MUNrRTlzeXJGV3IrSTFxTkltMkZoN3pyR1diSWRPL2gvMgpIOEFKY0xEM3QvdzNuZGUrdWl3dnFMbTVhUTcwS0k4Q2ZoZk5Mam9WcmUxTFMwK1ZxRjNlOEl6cXFtSEFQLytJCjk0aDFsOEMreVU5MHFxa3E4OFE5Ci0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K",
"encryptionCertificateId": "my-id",
"expirationDateTime": "2025-11-27T14:00:00.0000000Z",
"clientState": "my-state"
}
' \
"https://graph.microsoft.com/v1.0/subscriptions"
}

ACCESS_TOKEN=$(request_token_graph)
request_subscription

Notice the subscription is limit to up to 2 days in the future, and should be renew to continue to get events.


The encryption certification is the content of the cert_clean.txt that we've created.


Decryption of the Messages

A simple bash to handle the encrypted parts of the messages is below.


dataKey_base64=$(jq -r '.value[].encryptedContent.dataKey' event/event.json)
encrypted_data_base64=$(jq -r '.value[].encryptedContent.data' event/event.json)
dataSignature_base64=$(jq -r '.value[].encryptedContent.dataSignature' event/event.json)

# Decode the base64-encoded symmetric key
echo "$dataKey_base64" | base64 --decode > encrypted_key.bin

# Decrypt the symmetric key using your RSA private key with OAEP padding
openssl pkeyutl -decrypt -inkey key/private.key -pkeyopt rsa_padding_mode:oaep -in encrypted_key.bin -out symmetric_key.bin

# Extract first 16 bytes of symmetric key as IV (hex)
iv=$(xxd -p -l 16 symmetric_key.bin)

# Decode encrypted data
echo "$encrypted_data_base64" | base64 --decode > encrypted_data.bin

# Decrypt using AES-CBC with PKCS7 padding
openssl enc -aes-256-cbc -d -in encrypted_data.bin -out decrypted_data.json \
-K $(xxd -p -c 256 symmetric_key.bin) \
-iv "$iv"


Final Note

As expected for a Microsoft API this is a complicated method to get data. 


Why is double encryption of the messages required? We are already using TLS.

Why can't we subscribe forever?


Anyway, eventually it works and can be used to store the agents interactions. 
Have fun.

Monday, November 24, 2025

CI/CD in a Shell


 

Recently I had to create a CI/CD for a new project whose source repository was in bitbucket. There are standard methods to handle this, using triggers from bitbucket, AWS CodeBuild, and AWS CodePipeline. However, I had only read permissions from the bitbucket and hence was limited in my ability to use the standard tools. I've decided to create CI/CD in a bash and surprisingly I've found it exteremly simple as well as lower cost and faster that the standard tools. I am aware of the downside of using scripts for such processes, such as lake of visibility, redundancy, and standards, but still the result was so good I think startup project should definitely consider it.

Listed below are the shell based CI/CD components.


The Poll Script

The poll script is running on a t3a.nano EC2 whose price is ~3$/month.

It polls the bitbucket repository every 5 minutes, and once a change on the deployment related branch is located, it starts the builder EC2 VM, and runs the build and deploy script.

#!/bin/bash

set -eE

instanceId=""
publicIp=""
intervalSeconds=300

cleanup() {
if [ -n "${instanceId}" ]; then
echo "Stopping instance: ${instanceId}"
if ! aws ec2 stop-instances --instance-ids "${instanceId}"; then
echo "Warning: Failed to stop instance ${instanceId}. Will retry on next run."
else
echo "Instance stopped successfully."
fi
instanceId=""
fi
}

restart_script() {
echo "Command '$BASH_COMMAND' failed with exit code $?"
cleanup
echo "Restarting soon..."
sleep ${intervalSeconds}
exec "$0" "$@"
}

trap 'restart_script "$@"' ERR


runBuild(){
trap cleanup RETURN

instanceId=$(aws ec2 describe-instances \
--filters "Name=tag:Name,Values=my-builder-vm" \
--query "Reservations[*].Instances[*].InstanceId" \
--output text)

echo "Starting instance: ${instanceId}"
aws ec2 start-instances --instance-ids ${instanceId}

echo "Waiting for instance to be in 'running' state..."
aws ec2 wait instance-running --instance-ids ${instanceId}

publicIp=$(aws ec2 describe-instances \
--instance-ids ${instanceId} \
--query "Reservations[0].Instances[0].PublicIpAddress" \
--output text)

echo "Running build remote"
ssh -o StrictHostKeyChecking=no ec2-user@${publicIp} /home/ec2-user/build/my-repo/deploy/aws/production/deploy.sh

cleanup
echo "Build done"
}

checkOnce(){
echo "Check run time: $(date)"
commitFilePath=/home/ec2-user/build/last_commit.txt
latestCommit=$(git ls-remote git@bitbucket.org:my-project/my-repo.git my-deploy-branch | awk '{print $1}')
echo "Latest commit: ${latestCommit}"

lastCommit=$(cat ${commitFilePath} 2>/dev/null || echo "")
echo "Last deployed: ${lastCommit}"

if [ "${latestCommit}" != "${lastCommit}" ]; then
echo "New commit detected, starting build"
runBuild
echo "${latestCommit}" > ${commitFilePath}
echo "last commit updated"
else
echo "No new commits"
fi
}

while true; do
checkOnce
sleep ${intervalSeconds}
done


To make this script part of the poller VM instance startup, use the following:


sudo cat <<EOF > /etc/systemd/system/poll.service
[Unit]
Description=Poll Script Startup
After=network.target

[Service]
Type=simple
ExecStart=/home/ec2-user/build/poll.sh
Restart=on-failure
User=ec2-user
WorkingDirectory=/home/ec2-user/build
StandardOutput=append:/home/ec2-user/build/output.txt
StandardError=append:/home/ec2-user/build/output.txt

[Install]
WantedBy=multi-user.target
EOF


sudo systemctl daemon-reload
sudo systemctl enable poll.service # auto-start on boot
sudo systemctl start poll.service # start immediately


The Build Script - Step 1

The build script is running on c6i.4xlarge EC2 whose price is ~500$/month, but I don't care since this EC2 instance is running only during the deployment itself, so the price is very low here as well.


The script runs on the repository itself, which I've manually cloned once after the EC2 creation. It only pulls the latest version and runs another "step 2" script to handle the build. The goal is to be able to accept changes into "step 2" script as part of the git pull.


#!/bin/bash
set -e

cd /home/ec2-user/build/my-repo
git checkout my-deploy-branch
git pull

./deploy_step2.sh


The Build Script - Step 2

The "step 2" script does the actual work: 

  1. Increments the build number
  2. Builds the docker images
  3. Login to the ECR
  4. Push the images to ECR
  5. Push a new tag to the GIT
  6. uses `helm upgrade` to upgrade the production deployment.


Notice that the EC2 uses a role that enables it to access the ECR and the EKS without user and password, for example:

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ecr:GetAuthorizationToken",
"ecr:BatchCheckLayerAvailability",
"ecr:CompleteLayerUpload",
"ecr:UploadLayerPart",
"ecr:InitiateLayerUpload",
"ecr:PutImage"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"eks:DescribeCluster"
],
"Resource": "*"
}
]
}


The script is:

#!/bin/bash
set -e

export AWS_ACCOUNT=123456789012
export AWS_REGION=us-east-1
export AWS_DEFAULT_REGION=${AWS_REGION}
export EKS_CLUSTER_NAME=my-eks

rootFolder=/home/ec2-user/build
buildVersionFile=${rootFolder}/build_number.txt

if [[ -f "${buildVersionFile}" ]]; then
lastBuildNumber=$(cat "${buildVersionFile}")
else
lastBuildNumber=1000
fi
newBuildNumber=$((lastBuildNumber + 1))
echo "${newBuildNumber}" > ${buildVersionFile}
echo "Build number updated to: ${newBuildNumber}"

./build_my_images.sh

aws ecr get-login-password --region ${AWS_REGION} | docker login --username AWS --password-stdin ${AWS_ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com
RemoteTag=deploy-${newBuildNumber} ./push_images_to_ec2.sh

newTag=deploy-${newBuildNumber}
git tag ${newTag}
git push origin ${newTag}

DEPLOY_VERSION=":${newTag}" ./helm_deploy.sh

echo "I did it again!"


Final Note

This build system is super fast. Why? Because it uses local cache for the docker images. This means we do not require a docker proxy to cache the images, which also makes it cheap.

To sum: don't use this for a big project, but you can use it for startups for sure.




Monday, November 10, 2025

Microsoft External Threat Detection


 


In this post we review the steps to create an external security provider to protect the Microsoft copilot studio based Agents.

Most of this post is based on this article.

Before starting, prepare yourself. Following Microsoft best practice, they've made it a super complex process, but in the end it it working, so that's good.


Provide a service

We start by implementing a service following this guide.

In general this service should provide 2 endpoints: /validate and /analyze-tool-execution.

The /validate endpoint is used only to check the service health and integration with Microsoft Authentication. For this post we will not implement Microsoft Authentication validation. Hence a simple implementation of the /validate is:



type ResponseSuccess struct {
IsSuccessful bool `json:"isSuccessful"`
Status string `json:"status"`
}

type Executor struct {
}

func (e *Executor) Execute(p web.Parser) interface{} {
log.Info("validate starting")
auth := p.GetHeader("Authorization")
log.Info("auth: %v", auth)
log.Info("validate done")
return &ResponseSuccess{
IsSuccessful: true,
Status: "OK",
}
}



The /analyze-tool-execution endpoint is activated in each step before the copilot agent invokes any action, and should approve or reject the action within 1 second (good luck with that). A simple example of implementation is:



type ResponseAllow struct {
BlockAction bool `json:"blockAction"`
}

type Executor struct {
}

func (e *Executor) Execute(p web.Parser) interface{} {
log.Info("analyze tool execution starting")

inputBytes, err := p.GetBodyAsBytes()
kiterr.RaiseIfError(err)

auth := p.GetHeader("Authorization")
log.Info("auth: %v", auth)
tenantId := kitjwt.GetJwtValue(auth, "tid")
applicationRegistrationId := kitjwt.GetJwtValue(auth, "appid")

log.Info("tenantId: %v", tenantId)
log.Info("applicationRegistrationId: %v", applicationRegistrationId)
log.Info("action description: %v", string(inputBytes))


log.Info("analyze tool execution done")
return &ResponseAllow{
BlockAction: false,
}
}

Once the service is implemented, deploy it and provide it with a valid TLS certificate. For the rest of this post we assume it is available in https://external.provider.com.


Register the Domain

Once the service is ready we need to register the domain in entra.microsoft.com.




Notice that as part of the process Microsoft requires you to prove you are the owner of the domain, so you need to add TXT record to the DNS server with a value specified by Microsoft.


App Registration

Create a new AppRegistration in entra.microsoft.com.
Then edit the AppRegistration and under "Expose an API" add the URL https://external.provider.com.

Next, edit the AppRegistration and under Certificates & secrets, select the Federated credentials tab, and add a new credential.
Scenario: Other
Issuer: https://login.microsoftonline.com/55fb1683-57de-46d1-8896-f9f3b07b549f/v2.0
Type: Explicit

The get the "Value" you need to run the following script:

# YOUR TENANT ID HERE
$guid = [Guid]::Parse("55fb1683-57de-46d1-8896-xxxxxxxx")
$base64Url = [Convert]::ToBase64String($guid.ToByteArray()).Replace('+','-').Replace('/','_').TrimEnd('=')
Write-Output $base64Url

# YOUR ENDPOINT ID HERE
$endpoint = "https://external.provider.com/analyze-tool-execution"
$base64Url = [Convert]::ToBase64String([Text.Encoding]::UTF8.GetBytes($endpoint)).Replace('+','-').Replace('/','_').TrimEnd('=')
Write-Output $base64Url


This script outputs 2 values, use them to create the following

/eid1/c/pub/t/FIRST_LINE_OUTPUT/a/m1WPnYRZpEaQKq1Cceg--g/SECOND_LINE_OUTPUT
 

Enable The Threat Detection

In https://admin.powerplatform.microsoft.com enable the threat detection.




Final Note

As promised, the super complex process is now done, and agents related events start streaming into the service which can approve or block them.


Monday, November 3, 2025

Microsoft AI Agent

 

In this post we will create an agent using Microsoft Copilot Studio.


Disclaimer About Microsoft

I've known Microsoft for more than 30 years. It used to be a monopoly company with good products, but overtime they lost their path and vision. What left of it is a monopoly company with bad products. Still, as a monopoly company Microsoft can force the market to use their new products even if they're bad and expensive. An good example for this is the creation of AI  agent using Microsoft Copilot studio.


License

Unlike other providers, to work with Microsoft products you need a license, regardless of the usage amount. This license is extremely compared with other providers. In case you only want to check abilities, and not willing to pay yet, check if you're eligible for the Microsoft E5 developer program.


Copilot Studio

Open the Microsoft Copilot Studio site, select Agents, and create a new agent.



Now we can use an AI to create our agent by simply describing the agent, or we can configure it manually.






Once we click on Create agent, the agent is ready, and we can test it.



We can also configure MCP server to be used under the tools section, for example I've used the public Docusign MCP server.






Once we're done, we can publish the agent, and since Microsoft is a monopoly company where office is the standard application for almost all companies, you can place the agent in an easily accessed location such as Microsoft Teams.


Purview Audit

A must-have method of an agent service is to track the chats, and tune its configuration to make it more useful. To track the chats we can use Microsoft Purview.

In Microsoft Purview site, select Solutions, Audit, and the run a new search filtered by time, and optionally filtered by RecordType=CopilotInteraction.





After some time between 5 minutes and 5 hours (it is Microsoft, so have no high expectations), you will get a search results record.




Final Note

We've seen how to use Microsoft products to create an agent and to track the agent actions. There are other related products that should be used to get a complete solution for agents creation, such as: Microsoft PowerApps, Microsoft PowerApps Admin, Microsoft Dataverse. And yes, you will need to pay for each one of these regardless of the usage amount. Have fun.



Tuesday, October 21, 2025

AWS Bedrock Agent



In this post we show creation of Agent in AWS bedrock and a simple text chat application using python boto3 library.


Create Agent in AWS Console


First open the AWS console, navigate to the AWS bedrock service, and click on Agents.



Click on Create Agent, enter a name for it, and click Create.

For our demo, we switch to the cheapest available model:



Fill in instructions for the agent:

Now click on Save and Exit.

Click on Prepare to create a draft of the agent:



We can now test the agent is working as expected, or else update the agent instructions.



Now we create an alias for the agent, which is a published version of the agent.




Chat using python boto3

Use the following code to run a simple chat with the agent.

import uuid

import boto3

REGION = "us-east-1"
AGENT_ID = "AB5BQ7PVAL"
AGENT_ALIAS_ID = "9BII1XBJP9"

client = boto3.client('bedrock-agent-runtime', region_name=REGION)

session_id = str(uuid.uuid4())

print("Chat Starting")

while True:
user_input = input("Enter prompt:")
if user_input.lower() in ['exit', 'quit']:
print("Bye")
break

response = client.invoke_agent(
agentId=AGENT_ID,
agentAliasId=AGENT_ALIAS_ID,
sessionId=session_id,
inputText=user_input,
)

response_body = response['completion']

print("Answer:", end=" ", flush=True)
for chunk in response_body:
if 'chunk' in chunk:
print(chunk['chunk']['bytes'].decode("utf-8"), end="", flush=True)
print()


An example of this chat is below.



Final Note

This is a very simple example of the AWS bedrock agent activation. Other agent properties include guardrails, agent memory across sessions, and multi-agent configuration.

Best practices for agents creation can be found here.


Sunday, October 12, 2025

GO Embed

 



In this post we review the GO embed annotation and its implications.


Sample Code


package main

import (
"embed"
_ "embed"
"fmt"
)

//go:embed hello.txt
var textFile string

//go:embed hello.txt
var binaryFile []byte

//go:embed data1 data2
var files embed.FS

func main() {
fmt.Printf("%v bytes:\n%v\n", len(binaryFile), textFile)

entries, err := files.ReadDir(".")
if err != nil {
panic(err)
}
for _, entry := range entries {
fmt.Printf("%v dir: %v\n", entry.Name(), entry.IsDir())
}
}


Implications

The GO embed is a simple way of adding files as part of the GO compiled output binary. It serves as an aletnative to making this files available to the application in other mannger, such as supplying the files as part of a docker image, or mounting the files using a kubernetes ConfigMap.

Notice the files are added as part of the binary, so embedding large files means a larger output binary.


Embed Methods

The are 3 methods to embed a file.

First we can add a file as a string. In such a case we should add the explicit embed import:

_ "embed"


Second we can add the file as bytes array, this is very similar to the first method.


Third we can include a set of folders as a virtual file system. The annotation includes the list of folders to be included. There are special handling for files starting with a dot, see more about this in here.


Final Note

While embed is a simple way to add files, it should be used only if we're sure we will not want to change the files in an active running deployment.






Sunday, October 5, 2025

How To Improve LLM Inference Performance


 

In this post we will review possible changes to a LLM inference code to make it run faster and use less GPU memory.


LLM inference is the usage of a trained model for a new data and producing a classification or a prediction. This is usually the production time usage of the model we've selected and possibly fine-tuned for the actual data stream. 


The inference runs the following steps:

  1. Load the model to the memory once in the process startup
  2. Get an input of a single or preferably a batch of inputs 
  3. Forward calculation on the neural network and produce a result

The LLM performance term is actually used for both different subjects:
  1. The precision of the LLM such as false-positive, false-negative
  2. The GPU memory usage and time for the inference process
We will see later that while these are two different goals, they are actually intertwined.

Below is a sample code of model inference where we can see the model loading and the model inference.

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

tokenizer = AutoTokenizer.from_pretrained("ProtectAI/deberta-v3-base-prompt-injection-v2")
model = AutoModelForSequenceClassification.from_pretrained("ProtectAI/deberta-v3-base-prompt-injection-v2")

classifier = pipeline(
"text-classification",
model=model,
tokenizer=tokenizer,
truncation=True,
max_length=512,
device=torch.device(torch_device),
)

results = classifier(text)

# analyze the results
...


Let examine how can we improve the performance of the inference

Compile The Model

The first and simplest change is to compile the model:

model = torch.compile('cuda')

This simple instruction speeds up the inference up to 2 times faster!

The compile command coverts the mode python code to a compiled code and optimizes the model operations. Notice that the compile command should run only once, right after the model is loaded, and it has a small impact on the process startup time.

For the compile() instruction we will need to make sure we have both python-dev and g++ installed. An example of this in a Dockerfile is:

RUN apt-get update && \
apt-get install -y software-properties-common curl wget build-essential g++ && \
add-apt-repository ppa:deadsnakes/ppa && \
apt-get update && \
apt-get install -y python3.12 python3.12-dev && \
/usr/bin/python3.12 --version



Change The Model Precision

By default the model is using a float32 precision, which means it has full accuracy in the neural network calculations. In most cases using float16 precision would do the work just as well and will consume half of the time and half of the memory. The change from float32 to float16 is called half().


model = model.to('cuda').half()

(Notice that model.half should be called BEFORE the model.compile)


Using half() might case a small increase in the false-positive and false-positive rate, but in most cases it is negligible. 


As a side note, we should mention that we can set the precision to int8 or int4, but this is a less common practice. For the record, here is an analyze of the alternatives from GPT:




Final Note

We have reviewed method of improving the memory footprint and runtime of an LLM. While there are some small implications on the accuracy, these methods should be used as a common practice for any LLM implementation.



Monday, September 22, 2025

NPX


 

In this post we will review NPX the Node Package Execute tool.

NPX is a command line util that is installed as part of the Node installation. Notice that this means that npx version is coupled with the node version.


The NPX temporary installs of packages that are used for a "script like"execution of a package. Instead of using npm to globally install a package and then run it, npx handles both.


The first step of NPX is to download the required package and its dependencies. The download target is:

~/.npm/_npx/<HASH>

The hash is based on the name and version of the package that NPX runs. Notice that the folder is never automatically removed, so once downloaded it will not be re-downloaded, that is unless we manually remove the cache folder or run a different version.

NPX however will not download to the cache folder if the package already exists in the current project node_modules folder, or if the package is globally installed.

To force NPX to download the latest version of a package and ignore any local or global installed version, we should specify the NPX flag --ignore-existing.


Common usages of NPX are:

npx create-react-app my-app
This will set the skeleton of a new react based application.

npx serve
This runs a file server on the current folder, enabling a quick review of the HTML files using a browser.


By default the NPX downloads the package, and the checks for "bin" entry in the package.json, which specifies the javascript file name to run. However we can manually determine the command to run using the syntax:

npx --package my-package my-command

In such case NPX would download the package and then look for the command under the bin element in package JSON and run it. Notice that we cannot run any javascript file using NPX, but only the predefined entries in the bin element. We can however, download the package using NPX, and the run any javascript file using node from the local NPX cache folder.


Monday, September 15, 2025

Create GO MCP Server and a Valid SSL Certificate

 



In this post we create GO bassed MCP server. 

The server includes both SSE and Streaming support. Gemini CLI supports only SSE, but the standard seems to be going to HTTM Streaming protocol.

We support both HTTP server on port 80, and HTTPS server on port 443. When using HTTPS server on port 443, we also start a certificate management listener on port 80 which will allocate a valid certificate as long as we have a valid DNS pointing to our server. Notice that for the SSL certification allocated we need to allow all source IPs to the port 80 listener, since the request will arrive from the Lets' encrypt servers.


package main

import (
"context"
"encoding/json"
"fmt"
"net/http"

"github.com/modelcontextprotocol/go-sdk/jsonschema"
"github.com/modelcontextprotocol/go-sdk/mcp"
"golang.org/x/crypto/acme/autocert"
)

func main() {
implementation := mcp.Implementation{
Name: "Demo MCP Server",
}
mcpServer := mcp.NewServer(&implementation, nil)

mcpServer.AddTool(toolSchema(), toolExecute)

useSse := false
var mcpHandler http.Handler
if useSse {
mcpHandler = mcp.NewSSEHandler(func(*http.Request) *mcp.Server {
return mcpServer
})
} else {
mcpHandler = mcp.NewStreamableHTTPHandler(func(request *http.Request) *mcp.Server {
return mcpServer
}, nil)
}

useTls := true
if useTls {
certManager := autocert.Manager{
Prompt: autocert.AcceptTOS,
HostPolicy: autocert.HostWhitelist("my.domain.com"),
Cache: autocert.DirCache("certs"),
}

httpServer := &http.Server{
Addr: "0.0.0.0:443",
Handler: mcpHandler,
TLSConfig: certManager.TLSConfig(),
}

go func() {
err := http.ListenAndServe("0.0.0.0:80", certManager.HTTPHandler(nil))
if err != nil {
panic(err)
}
}()

err := httpServer.ListenAndServeTLS("", "")
if err != nil {
panic(err)
}
} else {
err := http.ListenAndServe("0.0.0.0:80", mcpHandler)
if err != nil {
panic(err)
}
}
}


Now we implement the tool that can both read input parameters and return a string result for the AI agent to send to the LLM.


type inputParameters struct {
Name string `json:"name"`
}

func toolSchema() *mcp.Tool {
return &mcp.Tool{
Name: "greet",
Description: "Say hi from me",
InputSchema: &jsonschema.Schema{
Type: "object",
Required: []string{"name"},
Properties: map[string]*jsonschema.Schema{
"name": {
Type: "string",
},
},
},
}
}

func toolExecute(
_ context.Context,
_ *mcp.ServerSession,
params *mcp.CallToolParamsFor[map[string]any],
) (
*mcp.CallToolResultFor[any],
error,
) {
bytes, err := json.Marshal(params.Arguments)
if err != nil {
panic(err)
}

var input inputParameters
err = json.Unmarshal(bytes, &input)
if err != nil {
panic(err)
}

content := mcp.TextContent{
Text: fmt.Sprintf("Hi %v, Demo MCP is at your service", input.Name),
}

result := mcp.CallToolResultFor[any]{
Content: []mcp.Content{
&content,
},
}

return &result, nil
}



Monday, September 8, 2025

Create a Python MCP Server

 




MCP  is the standard protocol for exposing tools for AI agents usage. The MCP exposes APIs and the documentation for each API to the LLM. In this post we create a simple python based MCP server and use it in Gemini CLI.


Create The MCP Server


To prepare the MCP server project use:

# install UV in case it is not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh

uv init magic_server
cd magic_server
uv venv
uv add "mcp[cli]"


Next we add our main.py and expose our tool:

from mcp.server.fastmcp import FastMCP

mcp = FastMCP("Demo")

@mcp.tool(name="do_magic_trick")
def do_magic_trick(a: int, b: int) -> int:
return a + b + 6

mcp.run(transport="stdio")
#mcp.run(transport="streamable-http")
#mcp.run(transport="sse")


We can run the MCP as a local tool using STDIO for the lower level communication, or using SSE and streamable-http to connect to the server over the network.


Run Gemini CLI and MCP Server STDIO


To use gemini CLI, make sure you have recent version of node, and run:

npx https://github.com/google-gemini/gemini-cli

Run the MCP as a local tool using the STDIO. To configure this, update the file ~/.gemini/settings.json


{
"selectedAuthType": "oauth-personal",
"mcpServers": {
"pythonTools": {
"command": "/home/my-user/.local/bin/uv",
"args": [
"run",
"main.py"
],
"cwd": "/home/my-user/magic_server",
"env": {
},
"timeout": 15000
}
}
}

Once we restart the gemini CLI it will run the MCP server to get the metadata of the exposed tools, and we can use the tools simple by using the prompt: "do the magic_trick for 5 and 6".


Run Gemini CLI and MCP Server Service

To expose the MCP server as a public tool, we need to run the MCP server over the network. For this we can use SSE and http-streaming, so update the main.py to use the relevant method.


Notice:

1. At this time gemini CLI supports only SSE, but it seems that http-streaming is the winning standard.

2. In case using TLS (HTTPS access) the server should use a valid TLS certificate.


To configure the gemini to use the tool over the network, update the file ~/.gemini/settings.json:

{
"selectedAuthType": "oauth-personal",
"mcpServers": {
"discoveredServer": {
"url": "http://localhost:8000/sse"
}
}
}


Unlike running MCP using the STDIO, in this case we need to run the MCP ourselves:

uv run main.py



Sunday, August 31, 2025

UV

 


In this post we will review the uv - a new python project manager.


I am not a heavy python user. I generally avoid using python for long-term existing projects as its maintainability is much complex due to its limited variable typing, and due to its single core usage. I usually use python for very short lived projects or for LLMs which get support almost only in python. Over the years I got used to all python pains: 


  • Installation and usage the correct version of python and pip
  • Creation of the python VENV
  • Dependencies management using the requirement.txt which somehow never works


And then, about a year ago, a new tool emerged: uv.

The uv provides a complete solution for the entire python project management. It includes:


  • Python version installation and management
  • Dependencies add and locking
  • New project creation
  • VENV management
  • Running helper tools


The funny thing about uv is that it is written in RUST, which is in my opinion a kind of an insult to python.


Anyways, listed below are some basic uv actions.


Create A New Project

To create a new project run the following commands.


mkdir demo
cd demo
uv init
uv python list
uv python pin 3.12
uv env


In case using PyCharm, configure it to use uv:

(make sure you have latest version of PyCharm)


PyCharm Settings --> Python --> Interpreter --> select the existing from the .venv


Dependencies

Adding dependecies is simple, for example, add flask.

uv add flask
uv lock

Notice the uv.lock is automatically updated upon any additional add of dependency.

Unit Test and Converage

To run unit test and converage, add the dependencies as DEV dependencies, and then run the related tests.

uv add --dev pytest coverage
uv run -m coverage run -m pytest
uv run -m coverage report



Example of tests are below.



main.py

def add(a, b):
return a + b

test_main.py

from main import add

def test_add():
assert add(2, 3) == 5