run KISS: September 2022

Saturday, September 24, 2022

Go Error Handling

In this post we will discuss Go error handling alternatives. Go provides several methods to handle errors. Let's examine these methods.

TL;DR

Don't return errors from you functions. Instead, use the panic and recover method.

The Panic Method

The basic method to handle errors is using the panic function, which is similar to throwing an exception.

Once panic is called, it creates an error which stops the current go routine execution raised until the top level that handles it. By default, unless something was add to recover from the error, the program terminates.

package main

import "os"

func main() {
 f1()
}

func f1() {
 _, err := os.ReadFile("myfile.txt")
 if err != nil {
  panic(err)
 }
}

And the output is:

panic: open myfile.txt: no such file or directory

goroutine 1 [running]:
main.f1(...)
        my-program/src/main.go:12
main.main()
        my-program/src/main.go:6 +0x45

But, this kind of code is usually used by example code, and not by production code. In a production code, we do not want our service to terminate in case of problems. To handle this, we use the return error method.

The Return Error Method

The return error is used by most of the Go code that we can see. It is also used by Go native libraries. In this method, each function that might fail returns an error in addition to its return values.

package main

import (
 "fmt"
 "os"
 "time"
)

func main() {
 for {
  err := f1()
  if err != nil {
   fmt.Printf("got error, but i will stay alive. error is: %v\n", err)
  }
  time.Sleep(time.Minute)
 }
}

func f1() error {
 _, err := os.ReadFile("myfile.txt")
 if err != nil {
  return err
 }
 return nil
}

And the output is:

got error, but i will stay alive. error is: open myfile.txt: no such file or directory

The problem in this case is that we lost track of the stack trace of the issue. In case of a long call stack, getting a short text message like "division by zero" is useless unless we can somehow related this error with the code location that encountered it. To solve this issue, we use wrapped errors.

The Wrapped Errors Method

Wrapping errors intends to add some information to the error message that assist to find the terms that caused the error, and the code location.

package main

import (
 "fmt"
 "os"
 "time"
)

func main() {
 for {
  err := f1()
  if err != nil {
   fmt.Printf("got error, but i will stay alive. error is: %v\n", err)
  }
  time.Sleep(time.Minute)
 }
}

func f1() error {
 err := f2()
 if err != nil {
  return fmt.Errorf("handling f2 failed: %v", err)
 }
 return nil
}

func f2() error {
 _, err := os.ReadFile("myfile.txt")
 if err != nil {
  return fmt.Errorf("read configuration failed: %v", err)
 }
 return nil
}

In this method we add explanation for the error whenever we return it, so the output is more verbose:

got error, but i will stay alive. error is: handling f2 failed: read configuration failed: open myfile.txt: no such file or directory

While this is the most common method to handle errors in Go, it has some issues. First, the code gets longer and less readable. Second, the verbose mode is nice, but it has less power than the full stack trace that we got using the panic method. To better handl error, we should use the panic and recover method.

Panic and Recover Method

While seldom used, this is the best method to handle errors in Go. The code is short and readable, and the errors include the full infomation.

We start by adding 2 helper functions to handle errors.

func panicIfError(err error) {
 if err != nil {
  panic(err)
 }
}

func getNiceError(panicError any) error {
 stack := string(debug.Stack())
 index := strings.LastIndex(stack, "panic")
 if index != -1 {
  stack = stack[index:]
  index = strings.Index(stack, "\n")
  if index != -1 {
   stack = stack[index+1:]
  }
 }
 return fmt.Errorf("%v\n%v", panicError, stack)
}

func recoverError(recoveredError *error) {
 panicError := recover()
 if panicError == nil {
  recoveredError = nil
 } else {
  *recoveredError = getNiceError(panicError)
 }
}

The panicIfError as it name implies calls to panic in case of error. The recoverError is used only in central locations in our which are considers top-level handling. An example for such handler is a web server executor handler for an incoming request.

The code is now modified to the following.

func main() {
 for {
  err := f1()
  if err != nil {
   fmt.Printf("got error, but i will stay alive. error is: %v\n", err)
  }
  time.Sleep(time.Minute)
 }
}

func f1() (recoveredError error) {
 defer recoverError(&recoveredError)
 f2()
 return recoveredError
}

func f2() {
 f3()
}

func f3() {
 _, err := os.ReadFile("myfile.txt")
 panicIfError(err)
}

In this example f1 is considered a top-level handler, so it calls (using defer) to the receover, and also returns error so ti could be handled without terminating the service. Notice the simplicity of the functions f2 and f3, which represent 99% of our code. No error in the return values, and simple flow without conditions. Not only that, we also get a full stack trace in case of error.

got error, but i will stay alive. error is: open myfile.txt: no such file or directory
        /home/alon/git/cto-proximity/images/analyzer/src/main.go:13 +0x65
main.f3()
        /home/alon/git/cto-proximity/images/analyzer/src/main.go:61 +0x4e
main.f2()
        /home/alon/git/cto-proximity/images/analyzer/src/main.go:56 +0x17
main.f1()
        /home/alon/git/cto-proximity/images/analyzer/src/main.go:51 +0x7a
main.main()
        /home/alon/git/cto-proximity/images/analyzer/src/main.go:41 +0x2f

This method simplifies our code, and removes ~ 50% of the code line, while providing more information in case of errors. The performance implications are neglectable.

Final Note

We've added the panic and recover helper functions to our code, and modified the code to use them. We're now working only using the panic and recover method, and it is so simple and fun to use. We just cannot believe that we used any other method before that. I highly recommend you doing the same.

Maybe Go libraries will also accept this idea in the future and will further simplify our code...

Monday, September 19, 2022

Performance Tuning for Redis based Microservices

In the last few years, I've been implementing Go based microservices that use Redis. Lately I have a performance tuning for one of our microservices. The target cost for the service (running on GCP) was 1$ per month per customer that uses the service. In the initial check I made, it turned out the cost was near 500$ per month... After several days of design changes and adjustment I was able to reach the cost requirement. In this post I will share the steps and the changes in the code to make this possible.

TL;DR

Tuning Redis based microservices requires the following:

Break long values to multiple keys
Use in-memory cache that updates periodically to reduce reads from redis
Use in-memory cache and reuse existing keys to reduce updates to redis
Use pipeline to stream redis updates
Reduce the size of the data

Finding The Problems

The first step in performance tuning is being able to identify the problem. Otherwise you're shooting in the dark. As the microservices are running on a kubernetes cluster, I've used prometheus and grafana to check the memory and CPU consumption of the redis and of the microservice.

See the post Using prometheus and grafana to monitor kubernetes pods CPU.

In addition I've added some of my own metrics to the microservice.

See the post Report prometheus metrics from a GO application.

Seeing the CPU and the memory graphs in grafana provides a fast observation of the current bottleneck. In case of GO base microservice, usage of pprof provides a clear view of the code locations where most of the time is spent, as well a memory allocations.

See the post Monitoring a GO application using pprof.

Breaking Long Values

One of the first things I've noticed is a very long redis access time on read and update operations. I've found the microservice using a redis HASH whose values are JSON objects, for example:

HSET customer-0001 item1 '{"property1":"value1","property2":"value2","property3":"value3","property4":"value4"}'

The actual object JSON had more that 20 fields, and the whole item JSON size was ~1K. The goal is to keep a list of items per customer. Usage of redis HASH in this case is required as the redis KEYS command is discouraged for production use, and the HASH provides a method to list the items.

This model causes any item update to cost read and write of 1K to redis. Redis reacted in a very slow response time.

The correct design for this issue is to use redis SET to keep a list of the items' names, and keep each item in its own JSON using redis HASH. This enables both a list of items, and small and granular read/update of an item. For example:

SADD customer-0001 item1
HSET customer-0001-item1 property1 value1
HSET customer-0001-item1 property2 value2
HSET customer-0001-item1 property3 value3
HSET customer-0001-item1 property4 value4

In-Memory Read Cache

The microservice serves customer related requests. Each request reads the customer related items, and then decides what to do next. The requests rate is high. This means we have a huge amount of requests constantly reading many keys/values from the redis. Every second we have thousands of reads. But then I've asked: do we really need to get the most recent state of the data model in each request? What if we get a several seconds old state? Turns out it was just fine to do this. Hence, I've created a model of the customers' items in the microservice memory. This entire model is reloaded every several seconds, hence we do pay the cost of reading all of the data, BUT then each request does not read the state from the redis. So instead of reading customer's items hundreds of times every second, we read it only once in several seconds.

In-Memory Update Cache

Each of the customer related requests updates several counters using INCRBY and HINCRBY. These counters update is not related to the in-memory read cache, as we need to get an increment done for every request. But here too, do we really need the counters to get updated in real time, or can we delay it by several seconds? In this case we could delay it. So instead of using the redis increment commands for every request, I've created an in-memory key-value state, which accumulates the increments, and updates them once in every second. Notice the performance gain here is high. Instead of the following commands:

HINCR customer-0001-item1 counter1 1
HINCR customer-0001-item1 counter2 1
HINCR customer-0001-item1 counter1 1
HINCR customer-0001-item1 counter2 1
HINCR customer-0001-item1 counter1 1
HINCR customer-0001-item1 counter2 1

We get aggregated commands:

HINCR customer-0001-item1 counter1 3
HINCR customer-0001-item1 counter2 3

Use Pipeline

The next step is a result of the in-memory update cache. We already accumulate many commands in the cache, so why wait for the redis reply for each command? Instead of running the commands one after the other, I've changed the in-memory udpate cache to use the redis pipeline. This reduces latency for updates as the microservice no longer waits each server reply for each command before sending the next command.

Reduce The Model Size

The microservice processes the customers requests, and save some of the information to the redis. This information is later processed by other microservices which transform the information to another representation, and send it to external services. The issue is that the transformed information is much smaller in size. So I've made the transformation in the microservice BEFORE saving it to the redis. This reduces both the redis updates commands size, and the memory footprint of the redis.

In general, examine your model. Do you need everything in this model? The size of the model is critical the the performance. In some cases, going back to the PM with a question: "This information costs 3$ per month per customer. Is it necessary for our product?"

Final Note

In this post I've presented the steps I've made to tune a microservice performance issue. Some of these steps could be identified in advance while implementing the service, but in real life, changing requirements and legacy code tend to accumulate these kinds of issues. It is important to stress test microservices performance before going live, as well as periodically revisit their performance.

Sunday, September 11, 2022

GCP Jobs

In this post we will review usage of GCP jobs. GCP jobs can run containers to perform tasks. The nice thing about it is that we can also send parameters to the jobs, hence create an army of workers.

The Details

To use GCP job, we would probably want start with pushing the container image to GCR (GCP container registry). To push, use the following commands:

docker tag my-snapshot-local/my-image/latest gcr.io/my-project/my-repo/my-image/latest
docker push gcr.io/my-project/my-repo/my-image/latest
docker rmi gcr.io/my-project/my-repo/my-image/latest

Now we can run a job that uses this image:

gcloud beta batch jobs submit my-job --location us-central1 --config job.json

where the job.json is the configuration of the job:

{
  "name": "projects/my-project/locations/us-central1/jobs/my-job",
  "taskGroups": [
    {
      "taskCount": "1",
      "parallelism": "1",
      "taskSpec": {
        "runnables": [
          {
            "container": {
              "imageUri": "gcr.io/my-project/my-repo/my-image/latest",
              "entrypoint": "/my-entrypoint.sh",
              "commands": [
                "my-argument"
              ]
            }
          }
        ],
        "computeResource": {
          "cpuMilli": "4000",
          "memoryMib": "8192"
        }
      }
    }
  ],
  "allocationPolicy": {
    "instances": [
      {
        "policy": {
          "provisioningModel": "STANDARD"
        }
      }
    ]
  },
  "logsPolicy": {
    "destination": "CLOUD_LOGGING"
  }
}

Notice that the job configuration includes both the entrypoint and the arguments for the job. This allows us to submit the job with different arguments on each run, and create many containers working for us. For example, if we want to run a stress test for our site, this is a great method to create heavy load from different containers. Not only that, we can also create the load from different locations.

Final Note

We have presented how to use GCP jobs. In the project I've used the GCP jobs, I've automated submitting of a bulk of jobs, and used sed to automatically replace the arguments, for example:

jobs=4

for ((i=0;i<${jobs};i++));
do
  cp job.json job_temp.json
  sed -i "s/___JOB_ID___/${i}/g" job_temp.json
  sed -i "s/___JOB_INDEX___/${i}/g" job_temp.json
  sed -i "s/___JOBS___/${jobs}/g" job_temp.json
  gcloud beta batch jobs submit job${i} --location us-central1 --config job_temp.json
  rm job_temp.json
done

This shows the power of GCP jobs - with a very small amount of effort, we can get great results.

Sunday, September 4, 2022

GKE Autopilot - Bad Experience

In this post we will review my experience of using GKE autopilot.

TL;DR

While GKE autopilot is simpler to use, it has some restrictions, which limit its flexability. The bottom line is that I do not recommend using it.

The Story

I've been using GKE standard mode for a long time, and for a new project, I've decided to check out GKE autopilot. Google surely try to promote it, as it is the default mode for a new kubernetes cluster creation.

It does sound great, you no longer need to maintain node pools, which is a real burden, including the decision of which type of compute instances to use, and the pools size configuration. It also states that you pay only ~70$ per month for the cluster service, and everything else is only by the pods resources requirements specifications.

However, right from the start, I noticed that Google had decides to take some assumptions and constrains that simplify the GKE autopilot for them (the GKE implementors), and not for us (the GKE users).

When creating the GKE, the Telemetry (Cloud operation logging and monitoring) is turned on by default. There is no way to turn it off. See the question I've asked in stackoverflow. This means you pay, and might pay quite a lot in case of logs over the quota.

Using standard GKE automatically collects CPU and memory metrics, that you can collect using prometheus, and display in grafana, but for GKE autopilot, the kube-system namespace is managed and protected by Google, hence you cannot get these metrics, even the metrics-server is running as part ot the cluster. See another question I've asked about this in stackoverflow.

The last issue, which made me abandon the GKE autopilot is the resources requests limitations. The minimum resource requests is 0.25vCPU and 0.5GiB RAM. These are quite large values, specifically for dev and test clusters, and also for small microservices on a production cluster. Not only that, but you also need to maintain a ratio of 1:1 to 1:6.5 between CPU and memory. So if you need 1GiB RAM, you must allocate 1GiB RAM. These two last restrictions have high implications on the costs.

Final Note

As specified before, I do not recommed using GKE autopilot. It might be nice for a temporary cluster that you want to setup for a limited period of time, but in the end the restrications will make you abandon it.

In general I think Google's GKE is much better, simpler, and cheaper than AWS EKS, but the autopilot seems to me like a miss. I hope that as time passes these issues will be addressed.