run KISS: October 2025

Tuesday, October 21, 2025

AWS Bedrock Agent

In this post we show creation of Agent in AWS bedrock and a simple text chat application using python boto3 library.

Create Agent in AWS Console

First open the AWS console, navigate to the AWS bedrock service, and click on Agents.

Click on Create Agent, enter a name for it, and click Create.

For our demo, we switch to the cheapest available model:

Fill in instructions for the agent:

Now click on Save and Exit.

Click on Prepare to create a draft of the agent:

We can now test the agent is working as expected, or else update the agent instructions.

Now we create an alias for the agent, which is a published version of the agent.

Chat using python boto3

Use the following code to run a simple chat with the agent.

import uuid

import boto3

REGION = "us-east-1"
AGENT_ID = "AB5BQ7PVAL"
AGENT_ALIAS_ID = "9BII1XBJP9"

client = boto3.client('bedrock-agent-runtime', region_name=REGION)

session_id = str(uuid.uuid4())

print("Chat Starting")

while True:
  user_input = input("Enter prompt:")
  if user_input.lower() in ['exit', 'quit']:
    print("Bye")
    break

  response = client.invoke_agent(
    agentId=AGENT_ID,
    agentAliasId=AGENT_ALIAS_ID,
    sessionId=session_id,
    inputText=user_input,
  )

  response_body = response['completion']

  print("Answer:", end=" ", flush=True)
  for chunk in response_body:
    if 'chunk' in chunk:
      print(chunk['chunk']['bytes'].decode("utf-8"), end="", flush=True)
  print()

An example of this chat is below.

Final Note

This is a very simple example of the AWS bedrock agent activation. Other agent properties include guardrails, agent memory across sessions, and multi-agent configuration.

Best practices for agents creation can be found here.

Sunday, October 12, 2025

GO Embed

In this post we review the GO embed annotation and its implications.

Sample Code

package main

import (
  "embed"
  _ "embed"
  "fmt"
)

//go:embed hello.txt
var textFile string

//go:embed hello.txt
var binaryFile []byte

//go:embed data1 data2
var files embed.FS

func main() {
  fmt.Printf("%v bytes:\n%v\n", len(binaryFile), textFile)

  entries, err := files.ReadDir(".")
  if err != nil {
   panic(err)
  }
  for _, entry := range entries {
   fmt.Printf("%v dir: %v\n", entry.Name(), entry.IsDir())
  }
}

Implications

The GO embed is a simple way of adding files as part of the GO compiled output binary. It serves as an aletnative to making this files available to the application in other mannger, such as supplying the files as part of a docker image, or mounting the files using a kubernetes ConfigMap.

Notice the files are added as part of the binary, so embedding large files means a larger output binary.

Embed Methods

The are 3 methods to embed a file.

First we can add a file as a string. In such a case we should add the explicit embed import:

_ "embed"

Second we can add the file as bytes array, this is very similar to the first method.

Third we can include a set of folders as a virtual file system. The annotation includes the list of folders to be included. There are special handling for files starting with a dot, see more about this in here.

Final Note

While embed is a simple way to add files, it should be used only if we're sure we will not want to change the files in an active running deployment.

Sunday, October 5, 2025

How To Improve LLM Inference Performance

In this post we will review possible changes to a LLM inference code to make it run faster and use less GPU memory.

LLM inference is the usage of a trained model for a new data and producing a classification or a prediction. This is usually the production time usage of the model we've selected and possibly fine-tuned for the actual data stream.

The inference runs the following steps:

Load the model to the memory once in the process startup
Get an input of a single or preferably a batch of inputs
Forward calculation on the neural network and produce a result

The LLM performance term is actually used for both different subjects:

The precision of the LLM such as false-positive, false-negative
The GPU memory usage and time for the inference process

We will see later that while these are two different goals, they are actually intertwined.

Below is a sample code of model inference where we can see the model loading and the model inference.

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

tokenizer = AutoTokenizer.from_pretrained("ProtectAI/deberta-v3-base-prompt-injection-v2")
model = AutoModelForSequenceClassification.from_pretrained("ProtectAI/deberta-v3-base-prompt-injection-v2")

classifier = pipeline(
  "text-classification",
  model=model,
  tokenizer=tokenizer,
  truncation=True,
  max_length=512,
  device=torch.device(torch_device),
)

results = classifier(text)

# analyze the results
...

Let examine how can we improve the performance of the inference

Compile The Model

The first and simplest change is to compile the model:

model = torch.compile('cuda')

This simple instruction speeds up the inference up to 2 times faster!

The compile command coverts the mode python code to a compiled code and optimizes the model operations. Notice that the compile command should run only once, right after the model is loaded, and it has a small impact on the process startup time.

For the compile() instruction we will need to make sure we have both python-dev and g++ installed. An example of this in a Dockerfile is:

RUN apt-get update && \
    apt-get install -y software-properties-common curl wget build-essential g++ && \
    add-apt-repository ppa:deadsnakes/ppa && \
    apt-get update && \
    apt-get install -y python3.12 python3.12-dev && \
    /usr/bin/python3.12 --version

Change The Model Precision

By default the model is using a float32 precision, which means it has full accuracy in the neural network calculations. In most cases using float16 precision would do the work just as well and will consume half of the time and half of the memory. The change from float32 to float16 is called half().

model = model.to('cuda').half()

(Notice that model.half should be called BEFORE the model.compile)

Using half() might case a small increase in the false-positive and false-positive rate, but in most cases it is negligible.

As a side note, we should mention that we can set the precision to int8 or int4, but this is a less common practice. For the record, here is an analyze of the alternatives from GPT:

Final Note

We have reviewed method of improving the memory footprint and runtime of an LLM. While there are some small implications on the accuracy, these methods should be used as a common practice for any LLM implementation.

Full Blog TOC

Full Blog Table Of Content with Keywords Available HERE

Tuesday, October 21, 2025

AWS Bedrock Agent

Create Agent in AWS Console

Chat using python boto3

Final Note

Sunday, October 12, 2025

GO Embed

Sample Code

Implications

Embed Methods

Final Note

Sunday, October 5, 2025

How To Improve LLM Inference Performance

Compile The Model

Change The Model Precision

Final Note