Full Blog TOC

Full Blog Table Of Content with Keywords Available HERE

Sunday, June 29, 2025

Error Backoff



 

Recently I've encountered an issue where we have a task trying to start and load some startup configuration, but the configuration did not exist. So it kept trying to load the configuration every second causing system load and filling the logs with errors. To mitigate this I've added a back-off mechanism that exponentially increases the delay between retries. This is a common practice for such tasks.

The implementation is below.



type Backoff struct {
nowTime kittime.NowTime
minPeriod time.Duration
maxPeriod time.Duration

currentPeriod time.Duration
lastFailure *time.Time
}

func ProduceBackoff(
nowTime kittime.NowTime,
minPeriod time.Duration,
maxPeriod time.Duration,
) *Backoff {
return &Backoff{
nowTime: nowTime,
minPeriod: minPeriod,
maxPeriod: maxPeriod,
currentPeriod: minPeriod,
}
}

func (b *Backoff) MarkSuccess() {
b.lastFailure = nil
b.currentPeriod = b.minPeriod
}

func (b *Backoff) MarkFail() {
if b.lastFailure == nil {
b.currentPeriod = b.minPeriod
} else {
b.currentPeriod = 2 * b.currentPeriod
if b.currentPeriod > b.maxPeriod {
b.currentPeriod = b.maxPeriod
}
}
b.lastFailure = b.nowTime.NowPointer()
}

func (b *Backoff) CanRun() bool {
if b.lastFailure == nil {
return true
}

passedDuration := b.nowTime.NowPointer().Sub(*b.lastFailure)
return passedDuration >= b.currentPeriod
}


An example usage is displayed in the short unit-test for this.



func (t *Test) check() {
t.backoff = backoff.ProduceBackoff(t.NowTime, time.Second, 10*time.Second)

for range 3 {
t.checkBackoffLoop()
t.Log("success")
t.backoff.MarkSuccess()
}
}

func (t *Test) checkBackoffLoop() {
t.Log("failure #1")
t.Assert(t.backoff.CanRun())
t.backoff.MarkFail()
t.verifyCannotRunUntil(time.Second)

t.Log("failure #2")
t.Assert(t.backoff.CanRun())
t.backoff.MarkFail()
t.verifyCannotRunUntil(2 * time.Second)

t.Log("failure #3")
t.Assert(t.backoff.CanRun())
t.backoff.MarkFail()
t.verifyCannotRunUntil(4 * time.Second)

t.Log("failure #4")
t.Assert(t.backoff.CanRun())
t.backoff.MarkFail()
t.verifyCannotRunUntil(8 * time.Second)

t.Log("failure #5")
for range 5 {
t.Assert(t.backoff.CanRun())
t.backoff.MarkFail()
t.verifyCannotRunUntil(10 * time.Second)
}
}

func (t *Test) verifyCannotRunUntil(
period time.Duration,
) {
seconds := int(period.Seconds())
for range seconds {
t.AssertFalse(t.backoff.CanRun())
t.NowTime.IncrementFakeTime(time.Second)
}

t.Assert(t.backoff.CanRun())
}


Final Note

After using the back-off mechanism our logs are cleaner, and the system load is lower. By the way, a similar mechanism is used in the kubernetes pod crash back off.

Sunday, June 22, 2025

Adding GPU node to AWS EKS


 


The following post lists the steps to add a GPU node to an existing kubernetes cluster on AWS EKS.


We start with creation of a new GPU based nodegroup and specifying the GPU instance type.


echo "create GPU nodegroup"

eksctl create nodegroup --cluster ${EKS_CLUSTER_NAME} \
--name gpu-nodes \
--node-type g4dn.xlarge \
--nodes 1 \
--managed


Next we want to avoid any pod from scheduling on this node, unless it had specifically added a toleration.


echo "taint the GPU node"

NODE=$(kubectl get nodes -l alpha.eksctl.io/nodegroup-name=gpu-nodes -ojsonpath='{.items[*].metadata.name}')
kubectl taint node ${NODE} gpu-workload=true:NoSchedule


To enable usage of GPU, drivers should be applied on the host. This is handled by NDIVIA operator.


echo "install NVIDIA GPU operator"

cat > operator-values.yaml <<EOF
tolerations:
- key: "gpu-workload"
value: "true"
effect: "NoSchedule"
EOF

helm repo add nvidia https://nvidia.github.io/gpu-operator
helm repo update
helm install gpu-operator nvidia/gpu-operator -f operator-values.yaml
rm operator-values.yaml


Finally we can see the nodes ready to serve GPU requirements.

echo "print nodes with GPU"

kubectl get nodes -o json | jq '.items[].status.allocatable' | grep nvidia





Saturday, June 14, 2025

AWS Batch for GPU Workloads




The Requirement

Recently I've had a requirement to run wordloads on GPUs. 
I wanted to use AWS jobs and not a static EC2 instance due to two reasons:

Reason 1

The workloads are sporadic, for example we might need to run them 10 times one day, and then no need to run them for a week. I did not want to manage starting and stopping a GPU based EC2 instance, and I did not want to leave an expensive EC2 instance running.

Reason 2

In some cases, I want to run 100 workloads in parallel. Using EC2 would mean either run it sequentially or managing multiple EC2 instances.



Some History


So I've started checking the AWS batch for GPU. I've already used AWS batch in several projects, see for example:



Hence I assumed it would be quick and easy. 



AWS Partial & Non-Friendly Support

Surprisingly, I've found the AWS batch support for GPU is partial and problematic. 

Lack of documentation

The documentation for AWS batch support on GPU is scarce. This is pretty weird since GPU is THE hottest topic today, and I would expect it to be more detailed. I've ended up collecting information from ChatGPT as well as some external sites.

Compute Environment

The simplest method of AWS batch is base on Fargate which includes automatic management of the compute environment. This mode is not supported for GPU based containers. The only supported modes are EC2 based and kubernetes based. So I decided going with EC2 based, which does uses EC2 instances, but the instances are automatically started and stopped upon need.

Multi Containers

I wanted to have 2 containers in each AWS job. One container is the management container that I wanted running on CPU, while the other container is the LLM models container running on GPU. Unfortunately while I was able to setup this configuration, when starting the AWS job it failed to move from the starting status to the runnable status. Looks like something in AWS cannot support multi containers on mixed CPU/GPU. I've ended up combining the two containers in a single container running two processes, which is a bad practice, but the best I could do.

AWS Console Issues

I've made all the configurations in the GUI of the AWS console, and started the AWS batch, but then I got and error that the ECS configuration cannot be used.
What?? I've using EC2 based configuration, why do I get error for ECS??
Eventually I've configured the AWS batch using AWS CloudFormation, and it did work fine. I think this is a some king of bug in the AWS console GUI.

Guessing your problems

I've finally made it to starting an AWS batch, but then the job got stuck in the runnable status and did not proceed to the running status. AWS provides zero assistance for such issue, I really had to dig into this by guessing the possible problems. Eventually I've found the EC2 that was automatically started by the AWS batch was unable to register itself into the AWS batch compute environment since it did not had a public IPv4 IP allocated. I've updated the configuration to enable public IPv4 allocation, and it worked.



Final Results

Eventually it was working. The solution includes the following:

CloudFormation Stack

AWSTemplateFormatVersion: '2010-09-09'

Resources:

ECSBatchJobRole:
Type: AWS::IAM::Role
Properties:
RoleName: ECSBatchJobRole
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service: ecs-tasks.amazonaws.com
Action: sts:AssumeRole
ManagedPolicyArns:
- arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy

BatchInstanceRole:
Type: AWS::IAM::Role
Properties:
RoleName: BatchInstanceRole
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service: ec2.amazonaws.com
Action: sts:AssumeRole
ManagedPolicyArns:
- arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role

BatchInstanceProfile:
Type: AWS::IAM::InstanceProfile
Properties:
InstanceProfileName: BatchInstanceProfile
Roles:
- Ref: BatchInstanceRole

BatchLaunchTemplate:
Type: AWS::EC2::LaunchTemplate
Properties:
LaunchTemplateName: batch-launch-template
LaunchTemplateData:
KeyName: alon
NetworkInterfaces:
- DeviceIndex: 0
AssociatePublicIpAddress: true
SubnetId: subnet-12345678912345678
Groups:
- sg-123456789aaaaaaaaa
BlockDeviceMappings:
- DeviceName: /dev/xvda
Ebs:
VolumeSize: 200
VolumeType: gp3
DeleteOnTermination: true
IamInstanceProfile:
Name: !Ref BatchInstanceProfile

GpuComputeEnvironment:
Type: AWS::Batch::ComputeEnvironment
Properties:
Type: MANAGED
ComputeEnvironmentName: GpuEC2ComputeEnvironment
ComputeResources:
Type: EC2
AllocationStrategy: BEST_FIT_PROGRESSIVE
MinvCpus: 0
MaxvCpus: 16
DesiredvCpus: 0
InstanceTypes:
- g4dn.2xlarge
Subnets:
- subnet-12345678912345678
InstanceRole: !Ref BatchInstanceProfile
LaunchTemplate:
LaunchTemplateId: !Ref BatchLaunchTemplate
Version: !GetAtt BatchLaunchTemplate.LatestVersionNumber
ServiceRole: !GetAtt BatchServiceRole.Arn
State: ENABLED

BatchServiceRole:
Type: AWS::IAM::Role
Properties:
RoleName: AWSBatchServiceRole
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service: batch.amazonaws.com
Action: sts:AssumeRole
ManagedPolicyArns:
- arn:aws:iam::aws:policy/service-role/AWSBatchServiceRole

GpuJobQueue:
Type: AWS::Batch::JobQueue
Properties:
JobQueueName: GpuJobQueue
Priority: 1
State: ENABLED
ComputeEnvironmentOrder:
- Order: 1
ComputeEnvironment: !Ref GpuComputeEnvironment

GpuJobDefinition:
Type: AWS::Batch::JobDefinition
Properties:
JobDefinitionName: GpuJobDefinition
Type: container
PlatformCapabilities:
- EC2
ContainerProperties:
Image: XXX.dkr.ecr.us-east-1.amazonaws.com/YYY:latest
Vcpus: 1
Memory: 4096
ResourceRequirements:
- Type: GPU
Value: "1"
Command:
- nvidia-smi
JobRoleArn: !GetAtt ECSBatchJobRole.Arn

and run it using AWS CLI:
aws cloudformation deploy \
--template-file batch_cloud_formation.yaml \
--stack-name gpu-batch-setup \
--capabilities CAPABILITY_NAMED_IAM

Start the AWS Batch from GoLang 

batchClient:=            batch.NewFromConfig(*awsSession.GetSession())

input := batch.SubmitJobInput{
JobName: aws.String(jobName),
JobQueue: aws.String(jobQueue),
JobDefinition: aws.String(jobDefinition),
ContainerOverrides: &types.ContainerOverrides{
Command: containerConfig.ContainerCommandLine,
Environment: b.createContainerEnvironment(containerConfig.ContainerEnvironment),
},
}

_, err := batchClient.SubmitJob(context.Background(), &input)
kiterr.RaiseIfError(err)



Final Note

Running workloads using AWS batch is great, but using GPU based batch is still a bit immature. Still even though going through these issues, it is still one of the best methods to accomplish this.
I believe AWS will provide a better support for this in the future.

Monday, June 9, 2025

Python Reflex

 




In this post we review the python reflext framework


The reflex framework is yet another attempt at creating GUI without using javascript, and without the need of create APIs on the server side for each operation. We've already seen some of such attempts, see for example: Streamlit. The reflex generates HTML and javascript with react components, and hence can be even deployed as a production solution.


Small Attempt At A Demo

To check this framework, I've created a small application with a login page, and the increment/decrement example from the reflex getting started document.

To start we need to install the reflex library, create a template for our application, and run it. The updates should be done in the ref/ref.py file, and the framework includes a hot-reload support, so there is no need of restarting for every code change.


pip install reflex
reflex init
reflex run


The ref/ref.py is the following:


import reflex as rx


class State(rx.State):
count: int = 0
user_name: str = ''
password: str = ''
user_id: int = 0

def login(self):
print(self.user_name)
print(self.password)
self.user_id = 42
return rx.redirect("/")

def verify_logged_in(self):
if self.user_id == 0:
return rx.redirect("/login")

def set_user(self, value):
self.user_name = value

def set_password(self, value):
self.password = value

def increment(self):
self.count += 1

def decrement(self):
self.count -= 1


@rx.page()
def login() -> rx.Component:
return rx.vstack(
rx.input(
State.user_name,
placeholder="Enter User Name",
on_change=State.set_user
),
rx.input(
State.password,
placeholder="Enter Password",
on_change=State.set_password,
),
rx.button(
"Login",
on_click=State.login
)
)


@rx.page(on_load=State.verify_logged_in)
def index() -> rx.Component:
return rx.container(
rx.color_mode.button(position="top-right"),
rx.vstack(
rx.heading(f"Welcome user {State.user_id}", size="9"),
rx.hstack(
rx.button(
"Decrement",
color_scheme="ruby",
on_click=State.decrement,
),
rx.button(
"Increment",
color_scheme="grass",
on_click=State.increment,
),
),
rx.heading(State.count, font_size="2em"),
spacing="5",
justify="center",
min_height="85vh",
),
)


app = rx.App()
app.add_page(index)
app.add_page(login)


And it is working like a charm (Almost).


Notice the state is managed on the server side and is updated using a websocket that includes a message for any state update. This is a very nice idea that reduces coding requirements, however the load on the server would be very high for multi-users application.

The state is actually saved per tab, so any tab keeps a session with the sever, and hence even I had logged-in on one tab, I need to re-login on the other tab.


Final Notes

As always, attempts to save work on the javascript/react and the server side APIs are nice for development stages but are bad for production in both performance aspect and development aspect.

While javascript is a unstructured, non-compiled, error prune language, it is still the best alternative for production, supplying the performance and flexibility required for a GUI application.

In addition, reflex claim to reduce need to use a new language, but you will need to invest a lot of time to understand the reflex components and how to use them correctly. In case you have a problem, reflex might show a useful error, but in many cases it show a non-user-friendly error, and good luck finding assistance for such error in a small reflex community.

It is a nice tool for a single developer just asking for a quick UI for his own created solution engine, but don't expect anything else from reflex.