run KISS: June 2025

Sunday, June 29, 2025

Error Backoff

Recently I've encountered an issue where we have a task trying to start and load some startup configuration, but the configuration did not exist. So it kept trying to load the configuration every second causing system load and filling the logs with errors. To mitigate this I've added a back-off mechanism that exponentially increases the delay between retries. This is a common practice for such tasks.

The implementation is below.


type Backoff struct {
    nowTime   kittime.NowTime
    minPeriod time.Duration
    maxPeriod time.Duration

    currentPeriod time.Duration
    lastFailure   *time.Time
}

func ProduceBackoff(
    nowTime kittime.NowTime,
    minPeriod time.Duration,
    maxPeriod time.Duration,
) *Backoff {
    return &Backoff{
       nowTime:       nowTime,
       minPeriod:     minPeriod,
       maxPeriod:     maxPeriod,
       currentPeriod: minPeriod,
    }
}

func (b *Backoff) MarkSuccess() {
    b.lastFailure = nil
    b.currentPeriod = b.minPeriod
}

func (b *Backoff) MarkFail() {
    if b.lastFailure == nil {
       b.currentPeriod = b.minPeriod
    } else {
       b.currentPeriod = 2 * b.currentPeriod
       if b.currentPeriod > b.maxPeriod {
          b.currentPeriod = b.maxPeriod
       }
    }
    b.lastFailure = b.nowTime.NowPointer()
}

func (b *Backoff) CanRun() bool {
    if b.lastFailure == nil {
       return true
    }

    passedDuration := b.nowTime.NowPointer().Sub(*b.lastFailure)
    return passedDuration >= b.currentPeriod
}

An example usage is displayed in the short unit-test for this.


func (t *Test) check() {
    t.backoff = backoff.ProduceBackoff(t.NowTime, time.Second, 10*time.Second)

    for range 3 {
       t.checkBackoffLoop()
       t.Log("success")
       t.backoff.MarkSuccess()
    }
}

func (t *Test) checkBackoffLoop() {
    t.Log("failure #1")
    t.Assert(t.backoff.CanRun())
    t.backoff.MarkFail()
    t.verifyCannotRunUntil(time.Second)

    t.Log("failure #2")
    t.Assert(t.backoff.CanRun())
    t.backoff.MarkFail()
    t.verifyCannotRunUntil(2 * time.Second)

    t.Log("failure #3")
    t.Assert(t.backoff.CanRun())
    t.backoff.MarkFail()
    t.verifyCannotRunUntil(4 * time.Second)

    t.Log("failure #4")
    t.Assert(t.backoff.CanRun())
    t.backoff.MarkFail()
    t.verifyCannotRunUntil(8 * time.Second)

    t.Log("failure #5")
    for range 5 {
       t.Assert(t.backoff.CanRun())
       t.backoff.MarkFail()
       t.verifyCannotRunUntil(10 * time.Second)
    }
}

func (t *Test) verifyCannotRunUntil(
    period time.Duration,
) {
    seconds := int(period.Seconds())
    for range seconds {
       t.AssertFalse(t.backoff.CanRun())
       t.NowTime.IncrementFakeTime(time.Second)
    }

    t.Assert(t.backoff.CanRun())
}

Final Note

After using the back-off mechanism our logs are cleaner, and the system load is lower. By the way, a similar mechanism is used in the kubernetes pod crash back off.

Sunday, June 22, 2025

Adding GPU node to AWS EKS

The following post lists the steps to add a GPU node to an existing kubernetes cluster on AWS EKS.

We start with creation of a new GPU based nodegroup and specifying the GPU instance type.

echo "create GPU nodegroup"

eksctl create nodegroup --cluster ${EKS_CLUSTER_NAME} \
 --name gpu-nodes \
 --node-type g4dn.xlarge \
 --nodes 1 \
 --managed

Next we want to avoid any pod from scheduling on this node, unless it had specifically added a toleration.

echo "taint the GPU node"

NODE=$(kubectl get nodes -l alpha.eksctl.io/nodegroup-name=gpu-nodes -ojsonpath='{.items[*].metadata.name}')
kubectl taint node ${NODE} gpu-workload=true:NoSchedule

To enable usage of GPU, drivers should be applied on the host. This is handled by NDIVIA operator.

echo "install NVIDIA GPU operator"

cat > operator-values.yaml <<EOF
tolerations:
- key: "gpu-workload"
  value: "true"
  effect: "NoSchedule"
EOF

helm repo add nvidia https://nvidia.github.io/gpu-operator
helm repo update
helm install gpu-operator nvidia/gpu-operator -f operator-values.yaml
rm operator-values.yaml

Finally we can see the nodes ready to serve GPU requirements.

echo "print nodes with GPU"

kubectl get nodes -o json | jq '.items[].status.allocatable' | grep nvidia

Saturday, June 14, 2025

AWS Batch for GPU Workloads

The Requirement

Recently I've had a requirement to run wordloads on GPUs.

I wanted to use AWS jobs and not a static EC2 instance due to two reasons:

Reason 1

The workloads are sporadic, for example we might need to run them 10 times one day, and then no need to run them for a week. I did not want to manage starting and stopping a GPU based EC2 instance, and I did not want to leave an expensive EC2 instance running.

Reason 2

In some cases, I want to run 100 workloads in parallel. Using EC2 would mean either run it sequentially or managing multiple EC2 instances.

Some History

So I've started checking the AWS batch for GPU. I've already used AWS batch in several projects, see for example:

- Getting AWS Batch Logs in Go

- AWS Batch in Go

- Using AWS Batch with boto3

Hence I assumed it would be quick and easy.

AWS Partial & Non-Friendly Support

Surprisingly, I've found the AWS batch support for GPU is partial and problematic.

Lack of documentation

The documentation for AWS batch support on GPU is scarce. This is pretty weird since GPU is THE hottest topic today, and I would expect it to be more detailed. I've ended up collecting information from ChatGPT as well as some external sites.

Compute Environment

The simplest method of AWS batch is base on Fargate which includes automatic management of the compute environment. This mode is not supported for GPU based containers. The only supported modes are EC2 based and kubernetes based. So I decided going with EC2 based, which does uses EC2 instances, but the instances are automatically started and stopped upon need.

Multi Containers

I wanted to have 2 containers in each AWS job. One container is the management container that I wanted running on CPU, while the other container is the LLM models container running on GPU. Unfortunately while I was able to setup this configuration, when starting the AWS job it failed to move from the starting status to the runnable status. Looks like something in AWS cannot support multi containers on mixed CPU/GPU. I've ended up combining the two containers in a single container running two processes, which is a bad practice, but the best I could do.

AWS Console Issues

I've made all the configurations in the GUI of the AWS console, and started the AWS batch, but then I got and error that the ECS configuration cannot be used.

What?? I've using EC2 based configuration, why do I get error for ECS??

Eventually I've configured the AWS batch using AWS CloudFormation, and it did work fine. I think this is a some king of bug in the AWS console GUI.

Guessing your problems

I've finally made it to starting an AWS batch, but then the job got stuck in the runnable status and did not proceed to the running status. AWS provides zero assistance for such issue, I really had to dig into this by guessing the possible problems. Eventually I've found the EC2 that was automatically started by the AWS batch was unable to register itself into the AWS batch compute environment since it did not had a public IPv4 IP allocated. I've updated the configuration to enable public IPv4 allocation, and it worked.

Final Results

Eventually it was working. The solution includes the following:

CloudFormation Stack

AWSTemplateFormatVersion: '2010-09-09'

Resources:

  ECSBatchJobRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: ECSBatchJobRole
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: ecs-tasks.amazonaws.com
            Action: sts:AssumeRole
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy

  BatchInstanceRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: BatchInstanceRole
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: ec2.amazonaws.com
            Action: sts:AssumeRole
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role

  BatchInstanceProfile:
    Type: AWS::IAM::InstanceProfile
    Properties:
      InstanceProfileName: BatchInstanceProfile
      Roles:
        - Ref: BatchInstanceRole

  BatchLaunchTemplate:
    Type: AWS::EC2::LaunchTemplate
    Properties:
      LaunchTemplateName: batch-launch-template
      LaunchTemplateData:
        KeyName: alon
        NetworkInterfaces:
          - DeviceIndex: 0
            AssociatePublicIpAddress: true
            SubnetId: subnet-12345678912345678
            Groups:
              - sg-123456789aaaaaaaaa
        BlockDeviceMappings:
          - DeviceName: /dev/xvda
            Ebs:
              VolumeSize: 200
              VolumeType: gp3
              DeleteOnTermination: true
        IamInstanceProfile:
          Name: !Ref BatchInstanceProfile

  GpuComputeEnvironment:
    Type: AWS::Batch::ComputeEnvironment
    Properties:
      Type: MANAGED
      ComputeEnvironmentName: GpuEC2ComputeEnvironment
      ComputeResources:
        Type: EC2
        AllocationStrategy: BEST_FIT_PROGRESSIVE
        MinvCpus: 0
        MaxvCpus: 16
        DesiredvCpus: 0
        InstanceTypes:
          - g4dn.2xlarge
        Subnets:
          - subnet-12345678912345678
        InstanceRole: !Ref BatchInstanceProfile
        LaunchTemplate:
          LaunchTemplateId: !Ref BatchLaunchTemplate
          Version: !GetAtt BatchLaunchTemplate.LatestVersionNumber
      ServiceRole: !GetAtt BatchServiceRole.Arn
      State: ENABLED

  BatchServiceRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: AWSBatchServiceRole
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: batch.amazonaws.com
            Action: sts:AssumeRole
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AWSBatchServiceRole

  GpuJobQueue:
    Type: AWS::Batch::JobQueue
    Properties:
      JobQueueName: GpuJobQueue
      Priority: 1
      State: ENABLED
      ComputeEnvironmentOrder:
        - Order: 1
          ComputeEnvironment: !Ref GpuComputeEnvironment

  GpuJobDefinition:
    Type: AWS::Batch::JobDefinition
    Properties:
      JobDefinitionName: GpuJobDefinition
      Type: container
      PlatformCapabilities:
        - EC2
      ContainerProperties:
        Image: XXX.dkr.ecr.us-east-1.amazonaws.com/YYY:latest
        Vcpus: 1
        Memory: 4096
        ResourceRequirements:
          - Type: GPU
            Value: "1"
        Command:
          - nvidia-smi
        JobRoleArn: !GetAtt ECSBatchJobRole.Arn

and run it using AWS CLI:

aws cloudformation deploy \
  --template-file batch_cloud_formation.yaml \
  --stack-name gpu-batch-setup \
  --capabilities CAPABILITY_NAMED_IAM

Start the AWS Batch from GoLang

batchClient:=            batch.NewFromConfig(*awsSession.GetSession())

input := batch.SubmitJobInput{
  JobName:       aws.String(jobName),
  JobQueue:      aws.String(jobQueue),
  JobDefinition: aws.String(jobDefinition),
  ContainerOverrides: &types.ContainerOverrides{
   Command:     containerConfig.ContainerCommandLine,
   Environment: b.createContainerEnvironment(containerConfig.ContainerEnvironment),
  },
}

_, err := batchClient.SubmitJob(context.Background(), &input)
kiterr.RaiseIfError(err)

Final Note

Running workloads using AWS batch is great, but using GPU based batch is still a bit immature. Still even though going through these issues, it is still one of the best methods to accomplish this.

I believe AWS will provide a better support for this in the future.

Monday, June 9, 2025

Python Reflex

In this post we review the python reflext framework.

The reflex framework is yet another attempt at creating GUI without using javascript, and without the need of create APIs on the server side for each operation. We've already seen some of such attempts, see for example: Streamlit. The reflex generates HTML and javascript with react components, and hence can be even deployed as a production solution.

Small Attempt At A Demo

To check this framework, I've created a small application with a login page, and the increment/decrement example from the reflex getting started document.

To start we need to install the reflex library, create a template for our application, and run it. The updates should be done in the ref/ref.py file, and the framework includes a hot-reload support, so there is no need of restarting for every code change.

pip install reflex
reflex init
reflex run

The ref/ref.py is the following:

import reflex as rx


class State(rx.State):
  count: int = 0
  user_name: str = ''
  password: str = ''
  user_id: int = 0

  def login(self):
    print(self.user_name)
    print(self.password)
    self.user_id = 42
    return rx.redirect("/")

  def verify_logged_in(self):
    if self.user_id == 0:
      return rx.redirect("/login")

  def set_user(self, value):
    self.user_name = value

  def set_password(self, value):
    self.password = value

  def increment(self):
    self.count += 1

  def decrement(self):
    self.count -= 1


@rx.page()
def login() -> rx.Component:
  return rx.vstack(
    rx.input(
      State.user_name,
      placeholder="Enter User Name",
      on_change=State.set_user
    ),
    rx.input(
      State.password,
      placeholder="Enter Password",
      on_change=State.set_password,
    ),
    rx.button(
      "Login",
      on_click=State.login
    )
  )


@rx.page(on_load=State.verify_logged_in)
def index() -> rx.Component:
  return rx.container(
    rx.color_mode.button(position="top-right"),
    rx.vstack(
      rx.heading(f"Welcome user {State.user_id}", size="9"),
      rx.hstack(
        rx.button(
          "Decrement",
          color_scheme="ruby",
          on_click=State.decrement,
        ),
        rx.button(
          "Increment",
          color_scheme="grass",
          on_click=State.increment,
        ),
      ),
      rx.heading(State.count, font_size="2em"),
      spacing="5",
      justify="center",
      min_height="85vh",
    ),
  )


app = rx.App()
app.add_page(index)
app.add_page(login)

And it is working like a charm (Almost).

Notice the state is managed on the server side and is updated using a websocket that includes a message for any state update. This is a very nice idea that reduces coding requirements, however the load on the server would be very high for multi-users application.

The state is actually saved per tab, so any tab keeps a session with the sever, and hence even I had logged-in on one tab, I need to re-login on the other tab.

Final Notes

As always, attempts to save work on the javascript/react and the server side APIs are nice for development stages but are bad for production in both performance aspect and development aspect.

While javascript is a unstructured, non-compiled, error prune language, it is still the best alternative for production, supplying the performance and flexibility required for a GUI application.

In addition, reflex claim to reduce need to use a new language, but you will need to invest a lot of time to understand the reflex components and how to use them correctly. In case you have a problem, reflex might show a useful error, but in many cases it show a non-user-friendly error, and good luck finding assistance for such error in a small reflex community.

It is a nice tool for a single developer just asking for a quick UI for his own created solution engine, but don't expect anything else from reflex.

Full Blog TOC

Full Blog Table Of Content with Keywords Available HERE

Sunday, June 29, 2025

Error Backoff

Final Note

Sunday, June 22, 2025

Adding GPU node to AWS EKS

Saturday, June 14, 2025

AWS Batch for GPU Workloads

The Requirement

Reason 1

Reason 2

Some History

AWS Partial & Non-Friendly Support

Lack of documentation

Compute Environment

Multi Containers

AWS Console Issues

Guessing your problems

Final Results

CloudFormation Stack

Start the AWS Batch from GoLang

Final Note

Monday, June 9, 2025

Python Reflex

Small Attempt At A Demo

Final Notes