Full Blog TOC

Full Blog Table Of Content with Keywords Available HERE

Monday, January 26, 2026

NATS Performance Tuning


 


In this post we can see how to install NATS cluster with configuration tuned for high throughput.


The following is an example script to install a tuned NATS cluster:


#!/usr/bin/env bash
set -e
cd "$(dirname "$0")"

helm repo add nats https://nats-io.github.io/k8s/helm/charts/
helm repo update

rm -f nats.yaml

cat <<EOF > nats.yaml
# nats-values.yaml

config:
jetstream:
enabled: false

cluster:
enabled: true
replicas: 4


merge:
# Bigger messages & buffers
max_payload: 8388608 # 8 * 1024 * 1024 bytes
write_deadline: "2s"
max_pending: 536870912 # 512 * 1024 * 1024 bytes
# Connection & subscription scaling
max_connections: 100000
max_subscriptions: 1000000

# Disable anything not needed
debug: false
trace: false
logtime: false


container:
merge:
resources:
limits:
cpu: "4"
memory: "2Gi"
EOF

helm delete nats --ignore-not-found

helm upgrade --install nats \
nats/nats \
--namespace default \
--create-namespace \
-f nats.yaml

rm -f nats.yaml


We are increasing the NATS cluster memory using the max_pending. This configure NATS to buffer messages allowing producer faster publish, and eventuall higher consumption by the consumer.


The producers and consumers actually connect to a random available NATS pod, so in case of small amount of consumers, producers, and NATS pods, we will probably have an unbalanced load for the NATS pods.

To avoid this, use a connection pool on both consumers and producers, hence spreading the load among the NATS pods.


Using messages size 1K, a single NATS pod can handle ~1M messages/seconds. Make sure checking the NATS pod is not saturated by CPU and memory.


Tuesday, January 20, 2026

Kind Load vs. Local Repository


 


In this post we reviewed using kind for local development. One of the requirement for such deployment is to load the image into kind using:

kind load docker-image ${DockerTag}

The kind load CLI sends the entire image to the kind, which means it does not use the docker image layers concept. This means that any image update will take much longer time, especially for large images.


To improve this we can use a local docker repository, and configure kind to load the images from it. To deploy a local repository our development laptop we use:


docker run -d \
--restart=always \
-p 5000:5000 \
--name kind-registry \
-v $HOME/.kind/registry:/var/lib/registry \
registry:2


Then we re-create the kind cluster with configuration to use the local registry:


cat <<EOF | kind create cluster --config=-
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
containerdConfigPatches:
- |
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."localhost:5000"]
endpoint = ["http://kind-registry:5000"]
nodes:
- role: control-plane
kubeadmConfigPatches:
- |
kind: InitConfiguration
nodeRegistration:
kubeletExtraArgs:
node-labels: "ingress-ready=true"
extraPortMappings:
- containerPort: 80
hostPort: 80
protocol: TCP
- containerPort: 443
hostPort: 443
protocol: TCP
EOF


To enable kind to communicate with the docker repository container, we connect the networks:

docker network connect kind kind-registry


And it is ready!


We can now load the build images into the local repository as part of our build script:

LocalRegistryTag=localhost:5000/my-image:latest
docker tag ${DockerTag} ${LocalRegistryTag}
docker push ${LocalRegistryTag}


We also should use an updated helm configuration for the images location, as well as instruct kubernetes to always pull the images from the registry. This is required since we usually do not change the image version upon local build, and simply use `latest`:


image:
registry: localhost:5000/
version: /dev:latest
pullPolicy: Always


The result is a faster build and deployment process on the local laptop.


Tuesday, January 13, 2026

GO Coding Guidelines

 



In this post I add specify example of GO coding guidelines. 

Many of these guidelines are language agnostic, so this can be used as a base for any other language.


Use full words

Names are important for clear text.

Always use full words.

For camel case make sure to use event for abbreviation:

  • myInstance

  • ssnFullText

  • clientIpAddress

No Comments

Our code should be clear and readable.

This means we should have comments only in case we do something unusual.

Do not comment on the obvious.

Look around for standards

Always look around for standards before making a change.

Someone had invested some thought about this code, and decided on a standard,

so we want to continue using the same standard and not create a new standard to confuse future development.

Each class in its own package

Each class should be its own package.

Do not include more than a single class in a package.

Clear TODOs

Do not leave TODOs in the code

Switch case default

Do not assume the default in a switch case is something that should not be handled.

In case we should not get to the default, raise an error.

Each Parameter in its own line

Each function parameter should be in its own line, even when it is the only parameter

Boolean condition

Boolean variable is a gift for usage in a condition.

Never compare boolean to true or false,

Unused code

All unused code must be removed. This includes: variables, functions, consts, comments, commented code and anything…

One letter self

Always use one letter for the “self”.

Explicit struct

Always create explicit structs.

Each struct should have a name and a definition.

Avoid using struct on the fly.


Bad Example:

Limit indentation

We are human, and we cannot understand complicated code with more than 3 indentations.

Use functions, and build a better clear design.

Empty array

When you need an empty array, always use nil

Max 15 codes line per function

Keep the functions short and clean

In case you have a function with more than 15 code lines,

split it wisely while keeping a clear code and reduce coupling.

Remove unnecessary code

Avoid adding code that is already handled by the GO language, for example:

* loop over nil array is ok, no need to add codition and check if the array is nil

Two words for public items

To bypass the IDE bug, any public struct field and public method must have at least 2 words.

Struct is always a pointer in function

Whenever sending a struct to a function and receiving a struct from a function,

use a pointer to the struct to avoid memory duplication. 

Notice that time.Time is also a struct.

Function for nothing

In case you have a function that does only a simple statement, rethink whether it is really required.

You might still want to keep it to avoid code duplication,

but otherwise remove the function and just call the statement directly..

Reduce Coupling in function parameters

Design the functions to receive as few parameters as possible.

This reduces the coupling between the calling code and the function.

Reduce coupling in object fields

The idea of an object is that we have fields that are shared among all methods.

Hence, do not send the fields as parameters.

Keep you private stuff private

Use public only if you must expose.

Upper case function & variable & structure should be avoided otherwise.

Don’t return errors

When you have errors, raise them, do not return an error in the function return code.

The only case when you need to handle errors is when you have a new top level code, such as a new go routine.

No hard coded configuration

Avoid using hard coded configuration.

Use environment variables, or better - use some kind of mechanism that can be updated on the fly.

JSON tags

Usage of JSON tags is required only when communicating with an external entity, or when saving to redis.

Don’t add JSON tags just because you can.

Use OOD

In a component that receives some common input parameters that are used across its functions,

avoid sending the parameters over and over again.

Instead, create an object, set the input in the object’s fields, and avoid resending the parameters.

Constants usage

Constants should be used if and only if a value is used more than once

Tests

Any code should have a project test.

Make sure to check the code also manually in the local KIND cluster.

Memory usage 

When using a persistent in-memory state, we must make sure the state memory is limited.

We must make sure the pod will not crash due to memory usage.

Passwords to frontend

Never return passwords from the backend to the GUI, this is a security issue

Logging to the relevant logger

Always use a common logger library for logging.

Make sure to include the ability to dynamically update the log level for a running pod.


Avoid complex types

We are simple humans, and we understand simple types.

If you have a complex type, create structures to simplify it.

Array of structs

Array of struct should always use point to the struct

Do not confuse this with a pointer to array.

Provide error details

When we have an error, we can find the exact location using the error stack trace.

However, we cannot know the values that cause the error.

Provide more details in such cases.

Don’t hide errors

When we have errors we should not hide them.

The default should be to raise the error.

If this is something we expect to have errors and we are 100% sure we cannot overcome this by retries or similar, 

at least log with WARN.

Don’t bypass bugs

If we have bugs, we want to raise errors, and not to avoid them.

Do not create code that ignores bugs.

Avoid code duplication

Duplicate code makes the understanding of the code harder,

and leaves room for bugs in case someone updates only one copy of the code.

Safe Locking

Always prefer clean locking that guarantees cleanup of locking in case of error.

So we should have a function starting with the following:

a.mutex.Lock()
defer a.mutex.Lock()



Monday, December 22, 2025

Create Module Federation Using NX

 



In this post we will review the steps to create a module federation using NX


What is NX?

NX is a monorepo tool for managing multiple projects in a single repository. In this scenario it is used both as a wizard to create the projects, and as the project runner.

What is Module Federation?

We can see a module federation as a "live" remote import of a library. The module federation includes:

Host: the main web server the browser is connecting to.

Remote: the module web server that the host is importing a library from.

The general idea is to have a microservice approach to the front-end architecture, allowing independent update of each microservice.

Creation Steps

Well it is working, but you will get many warnings.

Don't panic.


Step 1 - Create a new Nx workspace

npx create-nx-workspace@latest my-nx-workspace

  • Select Minimal starter 
  • Try the full Nx platform? … NO! This will use NX Cloud

 

cd my-nx-workspace
npm install --save-dev @nx/react
nx g @nx/react:app shop --bundler=webpack --style=css

  • Would you like to add routing to this application? No
  • Which linter would you like to use? … eslint
  • E2E test runner? None
  • Which port? 4200


nx serve shop

Open browser on http://localhost:4200, and verify it is working.

Press Ctrl+C on the NX command to stop and continue to the next step.

Step 2 - Convert shop into a Module Federation host 

nx g @nx/react:host shop --bundler=webpack

  • Which stylesheet format would you like to use? … CSS
  • Which test runner? None
  • Which bundler? webpack


Recheck everything is still running

nx serve shop

Open browser on http://localhost:4200, and verify it is working.

Press Ctrl+C on the NX command to stop and continue to the next step.

Step 3 - Create Remote App

nx g @nx/react:remote cart --host=shop --style=css --bundler=webpack


In the first terminal window run the remote:

nx serve cart

In the second terminal window run the host:

nx serve shop


Open browser on http://localhost:4200, and verify you can see both the host (Home) and the remote (Cart).

Final Note

We have reviewed the steps to create a module federation.

This is just a simple tutorial, but in real life it is much complex. You need to handle cases where the remote is down, proxy AJAX requests from the remote, handle CORS, mismatch dependencies and version and more. 

Bottom line, avoid this architecture whenever possible, and use it only if you have to for a very complex infrastructure that has high maintenance costs.


Monday, December 15, 2025

Collecting Kubernetes Logs using LOKI


 


In this post we review the step to deploy a kubernetes logs collection system using LOKI, Promtail, and grafana.


  • Grafana is a front-end GUI enabling view of the logs.
    I've already covered grafana deployment on kubernetes in this post.

  • LOKI is a logs storage and log query component that works like a charm in kubernetes environment

  • Promtail is responsible for sending the pods logs to LOKI.


Let's review the deployment steps.

Part of this is based on the LOKI deployment guide.


LOKI AWS entities

Create LOKI logs storage buckets:

aws s3api create-bucket --bucket  agentic-loki-chunks --region us-east-1  
aws s3api create-bucket --bucket agentic-loki-ruler --region us-east-1

Create a permission policy for LOKI

rm -f loki-s3-policy.json

cat <<EOF > loki-s3-policy.json
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "LokiStorage",
"Effect": "Allow",
"Action": [
"s3:ListBucket",
"s3:PutObject",
"s3:GetObject",
"s3:DeleteObject"
],
"Resource": [
"arn:aws:s3:::agentic-loki-chunks",
"arn:aws:s3:::agentic-loki-chunks/*",
"arn:aws:s3:::agentic-loki-ruler",
"arn:aws:s3:::agentic-loki-ruler/*"
]
}
]
}
EOF


aws iam create-policy --policy-name LokiS3AccessPolicy --policy-document file://loki-s3-policy.json
rm loki-s3-policy.json

Create LOKI trust policy to enable it to use the role in the kubernetes cluster

rm -f trust-policy.json

cat << EOF > trust-policy.json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::662909476770:oidc-provider/oidc.eks.us-east-1.amazonaws.com/id/873FB195FF4FAEC482E18822F7D4CBF9"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"oidc.eks.us-east-1.amazonaws.com/id/873FB195FF4FAEC482E18822F7D4CBF9:sub": "system:serviceaccount:loki:loki",
"oidc.eks.us-east-1.amazonaws.com/id/873FB195FF4FAEC482E18822F7D4CBF9:aud": "sts.amazonaws.com"
}
}
}
]
}
EOF




Create role and attach the policies


aws iam create-role --role-name LokiServiceAccountRole --assume-role-policy-document file://trust-policy.json
aws iam attach-role-policy --role-name LokiServiceAccountRole --policy-arn arn:aws:iam::662909476770:policy/LokiS3AccessPolicy

LOKI Deployment

We use helm to deploy LOKI on the kubernetes cluster.


kubectl create namespace loki

helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

rm -f values.yaml

cat << EOF > values.yaml

loki:
schemaConfig:
configs:
- from: "2024-04-01"
store: tsdb
object_store: s3
schema: v13
index:
prefix: loki_index_
period: 24h
storage_config:
aws:
region: us-east-1
bucketnames: agentic-loki-chunks
s3forcepathstyle: false
ingester:
chunk_encoding: snappy
pattern_ingester:
enabled: true
limits_config:
allow_structured_metadata: true
volume_enabled: true
retention_period: 672h # 28 days retention
compactor:
retention_enabled: true
delete_request_store: s3
ruler:
enable_api: true
storage:
type: s3
s3:
region: us-east-1
bucketnames: agentic-loki-ruler
s3forcepathstyle: false
alertmanager_url: http://prom:9093 # The URL of the Alertmanager to send alerts (Prometheus, Mimir, etc.)

querier:
max_concurrent: 4

storage:
type: s3
bucketNames:
chunks: "agentic-loki-chunks"
ruler: "agentic-loki-ruler"
s3:
region: us-east-1

serviceAccount:
create: true
annotations:
"eks.amazonaws.com/role-arn": "arn:aws:iam::662909476770:role/LokiServiceAccountRole"

deploymentMode: Distributed

ingester:
replicas: 2
zoneAwareReplication:
enabled: false

querier:
replicas: 2
maxUnavailable: 1

queryFrontend:
replicas: 2
maxUnavailable: 1

queryScheduler:
replicas: 2

distributor:
replicas: 2
maxUnavailable: 1
compactor:
replicas: 1

indexGateway:
replicas: 2
maxUnavailable: 1

ruler:
replicas: 1
maxUnavailable: 1


gateway:
service:
type: ClusterIP
basicAuth:
enabled: false

lokiCanary:
extraArgs: []
extraEnv: []

minio:
enabled: false

backend:
replicas: 0
read:
replicas: 0
write:
replicas: 0

singleBinary:
replicas: 0

EOF

helm upgrade -i --values values.yaml loki grafana/loki -n loki
rm values.yaml


Promtail

We use helm to deploy Promtail

rm -f values.yaml

cat << EOF > values.yaml

extraVolumes:
- name: positions
emptyDir: {}

extraVolumeMounts:
- name: positions
mountPath: /promtail/positions

config:
clients:
- url: http://loki-distributor.loki.svc.cluster.local:3100/loki/api/v1/push
tenant_id: test

positions:
filename: /promtail/positions/positions.yaml

scrape_configs:
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: pod


EOF

helm upgrade --install promtail grafana/promtail \
--namespace promtail --create-namespace \
-f values.yaml

rm values.yaml


Grafana


To view logs in grafana,  Add LOKI as datasource under Connections, Data sources:
http://loki-gateway.loki.svc.cluster.local

and add a customer header:
X-Scope-OrgID = test


Then add a visualization:

  • New dashboard
  • Settings, variables, add
    • Show on dashboard: "Label and Value"
    • Query type: "Label Values"
    • Label: "container"
    • Save
  • Add visualization
    • select on the top right "Logs"
    • select LOKI as data source
    • add filter container=${container}

Final Note


In this post we have reviewed the steps to collect kubernetes pods logs using Promtail, LOKI, and grafana. We have used the LOKI Simple Scalable Mode, but for a large kubernetes the LOKI Microservice mode is preferred.


Monday, December 1, 2025

Subscribing To Microsoft Copilot Events


 

In this post we will review how to subscribe to Microsoft Copilot events. 

Notice that the Microsoft Copilot uses a totally different mechanism than the Microsoft Copilot Studio agents, see the Microsoft External Threat Detection post.


Create Encryption Key

We start by create an a-symmetric encryption key:


openssl genrsa -out private.key 2048
openssl req -new -x509 -key private.key -out publicCert.cer -days 365
base64 publicCert.cer > publicCertBase64.txt
awk 'NF {printf "%s", $0}' publicCertBase64.txt > cert_clean.txt


Create App Registration

Use Microsoft Entra to create a new App Registration with  permissions:  AiEnterpriseInteraction.Read.All. Notice this permission is under the "Microsoft graph” section.

After adding the permissions to the AppRegistration, click the Grant Admin Consent button for the permissions.

We also add client secret allowing to use the AppRegistration in a script. As far as I could see, there is not GUI available for this, and we must use a script.


Create a Service

To supply a subscribe endpoint that Microsoft would send the messages to, create a public available service using a valid TLS certificate. For example, the endpoint can be:

https://my-site.com/interactions

Notice this endpoint should accept both GET and POST requests.

A very simple example of such endpoint is below.

func (e *Executor) Execute(p web.Parser) interface{} {
log.Info("interactions starting")
validation := p.QueryParam("validationToken")
log.Info("token: %v", validation)

data, err := p.GetBodyAsBytes()
if err != nil {
kiterr.RaiseIfError(err)
}
log.Info("body: %v", string(data))

p.SetHeader("Content-Type", "text/plain")
p.WriteStreamingResponse([]byte(validation))

return nil
}

Call the Subscribe API

Use the following to subscribe to events:

#!/bin/bash

TENANT_ID="12345678-1234-1234-1234-123456789012"
CLIENT_ID="12345678-1234-1234-1234-123456789012"
CLIENT_SECRET="abcdefghijklmnopqrstuvwxyz1234567890abcd"

request_token(){
SCOPE=graph.microsoft.com
curl -s -X POST "https://login.microsoftonline.com/$TENANT_ID/oauth2/v2.0/token" \
-H "Content-Type: application/x-www-form-urlencoded" \
-d "client_id=$CLIENT_ID&scope=https%3A%2F%2F${SCOPE}%2F.default&client_secret=$CLIENT_SECRET&grant_type=client_credentials" \
| jq -r '.access_token'
}

request_subscription(){
curl -H "Authorization: Bearer $ACCESS_TOKEN" \
-H "Content-Type: application/json" \
-d '
{
"changeType": "created,deleted,updated",
"notificationUrl": "https://my-site.com/interactions",
"resource": "/copilot/interactionHistory/getAllEnterpriseInteractions",
"includeResourceData": true,
"encryptionCertificate": "LS0tLS1CRUdJTiBDRjhgjkhgkjhgkjhJKJHKJLHLKJHLKJHKJLlkjhlkjhlkjhlkjhlkjhkojghlkjhlkjhlkjhlkjhlkjKJLHLKJHLKJHLKJH8769869876IUGHKLJHKJLYH876Y87H78YH87BN87HKJBJKHGKJLGKJLHlkjhlkjhkljhkjhlkjhlhjkEdVeElUQWZCZ05WQkFvTQpHRWx1ZEdWeWJtVjBJRmRwWkdkcGRITWdVSFI1SUV4MFpEQWVGdzB5TlRFeE1qY3hNakkzTURoYUZ3MHlOakV4Ck1qY3hNakkzTURoYU1FVXhDekFKQmdOVkJBWVRBa2xNTVJNd0VRWURWUVFJREFwVGIyMWxMVk4wWVhSbE1TRXcKSHdZRFZRUUtEQmhKYm5SbGNtNWxkQ0JYYVdSbmFYUnpJRkIwZVNCTWRHUXdnZ0VpTUEwR0NTcUdTSWIzRFFFQgpBUVVBQTRJQkR3QXdnZ0VLQW9JQkFRRFB4Ny8wVzc4N0NLUUh0dHMyVDBoL25LZ0o1ejArb1ZHeFFzcFhSWnlnCnBuanpETkdqUjBtWGFVU2RTZ2JWNW05MDMrNnhqbS9LbHpuTlltOTdoUjJNcnBFSXd1OVVYaWhxU1FTS1ZVcTkKbDk0OVEzME5PK29lT0Z4K3huOC9ycGFMVmpxUzIzR3VUV09Ka3p2aktPeXVnV1BRN3FBazgrdjQ3NjdVUkVvYQpJV2l3aXBIVW4rajBMOTVDTEtFOUZQUXdLMkUzNnZrdWNzd1krSGh5bm45N1piSGszVUM3NXd1QlYwTWVyT0o2CjVQdTFYQUVPZ2JnSFFVUEhuVkViT05MdkNwSUl1MHZlZDZFZmRQbVlzTk1IK2xHSlBOZnFOemRYSEZYSXE4VWMKbHdjbDlPRllUb0dMSEdHWTJiRWpzNWxFUjN1OWtLNFlvc1llUFc2ZmJ3NHhBZ01CQUFHalV6QlJNQjBHQTFVZApEZ1FXQkJUbW45UTBBcmFtVFNTK0phbWtIbzR3eVVVSDd6QWZCZ05WSFNNRUdEQVdnQlRtbjlRMEFyYW1UU1MrCkphbWtIbzR3eVVVSDd6QVBCZ05WSFJNQkFmOEVCVEFEQVFIL01BMEdDU3FHU0liM0RRRUJDd1VBQTRJQkFRREwKRVh2cnUxb0NKNXlERVc2Njc3RlRuQWt5bitheWJqQXBaVmRiRi9vMXZyZWZKWHVBVzdnZ09WZjBrT2xCN2U0WgoyQW0rUnU1bmNiRXdBN0o0L2N0WWlLdVByLzA4U0NjTnp6ZGp6RG9qem5wL1ZadnRiYXo5NGlVOE52YmRyWXBkCkVnb1o1RVk3YzZpQW9JNDlGK2ZNOGZLR3FrL09oVDA0dUNuWk1SUFpFR0lob1dBR1J0ODg1R1VXcVNEdzJDYVAKT3F6eU5WeS8vMFpWQm40dTBER3VjQjVLVkp0Smh0MUNrRTlzeXJGV3IrSTFxTkltMkZoN3pyR1diSWRPL2gvMgpIOEFKY0xEM3QvdzNuZGUrdWl3dnFMbTVhUTcwS0k4Q2ZoZk5Mam9WcmUxTFMwK1ZxRjNlOEl6cXFtSEFQLytJCjk0aDFsOEMreVU5MHFxa3E4OFE5Ci0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K",
"encryptionCertificateId": "my-id",
"expirationDateTime": "2025-11-27T14:00:00.0000000Z",
"clientState": "my-state"
}
' \
"https://graph.microsoft.com/v1.0/subscriptions"
}

ACCESS_TOKEN=$(request_token_graph)
request_subscription

Notice the subscription is limit to up to 2 days in the future, and should be renew to continue to get events.


The encryption certification is the content of the cert_clean.txt that we've created.


Decryption of the Messages

A simple bash to handle the encrypted parts of the messages is below.


dataKey_base64=$(jq -r '.value[].encryptedContent.dataKey' event/event.json)
encrypted_data_base64=$(jq -r '.value[].encryptedContent.data' event/event.json)
dataSignature_base64=$(jq -r '.value[].encryptedContent.dataSignature' event/event.json)

# Decode the base64-encoded symmetric key
echo "$dataKey_base64" | base64 --decode > encrypted_key.bin

# Decrypt the symmetric key using your RSA private key with OAEP padding
openssl pkeyutl -decrypt -inkey key/private.key -pkeyopt rsa_padding_mode:oaep -in encrypted_key.bin -out symmetric_key.bin

# Extract first 16 bytes of symmetric key as IV (hex)
iv=$(xxd -p -l 16 symmetric_key.bin)

# Decode encrypted data
echo "$encrypted_data_base64" | base64 --decode > encrypted_data.bin

# Decrypt using AES-CBC with PKCS7 padding
openssl enc -aes-256-cbc -d -in encrypted_data.bin -out decrypted_data.json \
-K $(xxd -p -c 256 symmetric_key.bin) \
-iv "$iv"


Final Note

As expected for a Microsoft API this is a complicated method to get data. 


Why is double encryption of the messages required? We are already using TLS.

Why can't we subscribe forever?


Anyway, eventually it works and can be used to store the agents interactions. 
Have fun.

Monday, November 24, 2025

CI/CD in a Shell


 

Recently I had to create a CI/CD for a new project whose source repository was in bitbucket. There are standard methods to handle this, using triggers from bitbucket, AWS CodeBuild, and AWS CodePipeline. However, I had only read permissions from the bitbucket and hence was limited in my ability to use the standard tools. I've decided to create CI/CD in a bash and surprisingly I've found it exteremly simple as well as lower cost and faster that the standard tools. I am aware of the downside of using scripts for such processes, such as lake of visibility, redundancy, and standards, but still the result was so good I think startup project should definitely consider it.

Listed below are the shell based CI/CD components.


The Poll Script

The poll script is running on a t3a.nano EC2 whose price is ~3$/month.

It polls the bitbucket repository every 5 minutes, and once a change on the deployment related branch is located, it starts the builder EC2 VM, and runs the build and deploy script.

#!/bin/bash

set -eE

instanceId=""
publicIp=""
intervalSeconds=300

cleanup() {
if [ -n "${instanceId}" ]; then
echo "Stopping instance: ${instanceId}"
if ! aws ec2 stop-instances --instance-ids "${instanceId}"; then
echo "Warning: Failed to stop instance ${instanceId}. Will retry on next run."
else
echo "Instance stopped successfully."
fi
instanceId=""
fi
}

restart_script() {
echo "Command '$BASH_COMMAND' failed with exit code $?"
cleanup
echo "Restarting soon..."
sleep ${intervalSeconds}
exec "$0" "$@"
}

trap 'restart_script "$@"' ERR


runBuild(){
trap cleanup RETURN

instanceId=$(aws ec2 describe-instances \
--filters "Name=tag:Name,Values=my-builder-vm" \
--query "Reservations[*].Instances[*].InstanceId" \
--output text)

echo "Starting instance: ${instanceId}"
aws ec2 start-instances --instance-ids ${instanceId}

echo "Waiting for instance to be in 'running' state..."
aws ec2 wait instance-running --instance-ids ${instanceId}

publicIp=$(aws ec2 describe-instances \
--instance-ids ${instanceId} \
--query "Reservations[0].Instances[0].PublicIpAddress" \
--output text)

echo "Running build remote"
ssh -o StrictHostKeyChecking=no ec2-user@${publicIp} /home/ec2-user/build/my-repo/deploy/aws/production/deploy.sh

cleanup
echo "Build done"
}

checkOnce(){
echo "Check run time: $(date)"
commitFilePath=/home/ec2-user/build/last_commit.txt
latestCommit=$(git ls-remote git@bitbucket.org:my-project/my-repo.git my-deploy-branch | awk '{print $1}')
echo "Latest commit: ${latestCommit}"

lastCommit=$(cat ${commitFilePath} 2>/dev/null || echo "")
echo "Last deployed: ${lastCommit}"

if [ "${latestCommit}" != "${lastCommit}" ]; then
echo "New commit detected, starting build"
runBuild
echo "${latestCommit}" > ${commitFilePath}
echo "last commit updated"
else
echo "No new commits"
fi
}

while true; do
checkOnce
sleep ${intervalSeconds}
done


To make this script part of the poller VM instance startup, use the following:


sudo cat <<EOF > /etc/systemd/system/poll.service
[Unit]
Description=Poll Script Startup
After=network.target

[Service]
Type=simple
ExecStart=/home/ec2-user/build/poll.sh
Restart=on-failure
User=ec2-user
WorkingDirectory=/home/ec2-user/build
StandardOutput=append:/home/ec2-user/build/output.txt
StandardError=append:/home/ec2-user/build/output.txt

[Install]
WantedBy=multi-user.target
EOF


sudo systemctl daemon-reload
sudo systemctl enable poll.service # auto-start on boot
sudo systemctl start poll.service # start immediately


The Build Script - Step 1

The build script is running on c6i.4xlarge EC2 whose price is ~500$/month, but I don't care since this EC2 instance is running only during the deployment itself, so the price is very low here as well.


The script runs on the repository itself, which I've manually cloned once after the EC2 creation. It only pulls the latest version and runs another "step 2" script to handle the build. The goal is to be able to accept changes into "step 2" script as part of the git pull.


#!/bin/bash
set -e

cd /home/ec2-user/build/my-repo
git checkout my-deploy-branch
git pull

./deploy_step2.sh


The Build Script - Step 2

The "step 2" script does the actual work: 

  1. Increments the build number
  2. Builds the docker images
  3. Login to the ECR
  4. Push the images to ECR
  5. Push a new tag to the GIT
  6. uses `helm upgrade` to upgrade the production deployment.


Notice that the EC2 uses a role that enables it to access the ECR and the EKS without user and password, for example:

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ecr:GetAuthorizationToken",
"ecr:BatchCheckLayerAvailability",
"ecr:CompleteLayerUpload",
"ecr:UploadLayerPart",
"ecr:InitiateLayerUpload",
"ecr:PutImage"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"eks:DescribeCluster"
],
"Resource": "*"
}
]
}


The script is:

#!/bin/bash
set -e

export AWS_ACCOUNT=123456789012
export AWS_REGION=us-east-1
export AWS_DEFAULT_REGION=${AWS_REGION}
export EKS_CLUSTER_NAME=my-eks

rootFolder=/home/ec2-user/build
buildVersionFile=${rootFolder}/build_number.txt

if [[ -f "${buildVersionFile}" ]]; then
lastBuildNumber=$(cat "${buildVersionFile}")
else
lastBuildNumber=1000
fi
newBuildNumber=$((lastBuildNumber + 1))
echo "${newBuildNumber}" > ${buildVersionFile}
echo "Build number updated to: ${newBuildNumber}"

./build_my_images.sh

aws ecr get-login-password --region ${AWS_REGION} | docker login --username AWS --password-stdin ${AWS_ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com
RemoteTag=deploy-${newBuildNumber} ./push_images_to_ec2.sh

newTag=deploy-${newBuildNumber}
git tag ${newTag}
git push origin ${newTag}

DEPLOY_VERSION=":${newTag}" ./helm_deploy.sh

echo "I did it again!"


Final Note

This build system is super fast. Why? Because it uses local cache for the docker images. This means we do not require a docker proxy to cache the images, which also makes it cheap.

To sum: don't use this for a big project, but you can use it for startups for sure.