Monday, September 19, 2022

Performance Tuning for Redis based Microservices


 


In the last few years, I've been implementing Go based microservices that use Redis. Lately I have a performance tuning for one of our microservices. The target cost for the service (running on GCP) was 1$ per month per customer that uses the service. In the initial check I made, it turned out the cost was near 500$ per month... After several days of design changes and adjustment I was able to reach the cost requirement. In this post I will share the steps and the changes in the code to make this possible.


TL;DR

Tuning Redis based microservices requires the following:

  • Break long values to multiple keys
  • Use in-memory cache that updates periodically to reduce reads from redis
  • Use in-memory cache and reuse existing keys to reduce updates to redis
  • Use pipeline to stream redis updates
  • Reduce the size of the data

Finding The Problems

The first step in performance tuning is being able to identify the problem. Otherwise you're shooting in the dark. As the microservices are running on a kubernetes cluster, I've used prometheus and grafana to check the memory and CPU consumption of the redis and of the microservice. 


In addition I've added some of my own metrics to the microservice.


Seeing the CPU and the memory graphs in grafana provides a fast observation of the current bottleneck. In case of GO base microservice, usage of pprof provides a clear view of the code locations where most of the time is spent, as well a memory allocations.


Breaking Long Values

One of the first things I've noticed is a very long redis access time on read and update operations. I've found the microservice using a redis HASH whose values are JSON objects, for example:

HSET customer-0001 item1 '{"property1":"value1","property2":"value2","property3":"value3","property4":"value4"}'


The actual object JSON had more that 20 fields, and the whole item JSON size was ~1K. The goal is to keep a list of items per customer. Usage of redis HASH in this case is required as the redis KEYS command is discouraged for production use, and the HASH provides a method to list the items.

This model causes any item update to cost read and write of 1K to redis. Redis reacted in a very slow response time.

The correct design for this issue is to use redis SET to keep a list of the items' names, and keep each item in its own JSON using redis HASH. This enables both a list of items, and small and granular read/update of an item. For example:


SADD customer-0001 item1
HSET customer-0001-item1 property1 value1
HSET customer-0001-item1 property2 value2
HSET customer-0001-item1 property3 value3
HSET customer-0001-item1 property4 value4
    


In-Memory Read Cache


The microservice serves customer related requests. Each request reads the customer related items, and then decides what to do next. The requests rate is high. This means we have a huge amount of requests constantly reading many keys/values from the redis. Every second we have thousands of reads. But then I've asked: do we really need to get the most recent state of the data model in each request? What if we get a several seconds old state? Turns out it was just fine to do this. Hence, I've created a model of the customers' items in the microservice memory. This entire model is reloaded every several seconds, hence we do pay the cost of reading all of the data, BUT then each request does not read the state from the redis. So instead of reading customer's items hundreds of times every second, we read it only once in several seconds. 


In-Memory Update Cache


Each of the customer related requests updates several counters using INCRBY and HINCRBY. These counters update is not related to the in-memory read cache, as we need to get an increment done for every request. But here too, do we really need the counters to get updated in real time, or can we delay it by several seconds? In this case we could delay it. So instead of using the redis increment commands for every request, I've created an in-memory key-value state, which accumulates the increments, and updates them once in every second. Notice the performance gain here is high. Instead of the following commands:



HINCR customer-0001-item1 counter1 1
HINCR customer-0001-item1 counter2 1
HINCR customer-0001-item1 counter1 1
HINCR customer-0001-item1 counter2 1
HINCR customer-0001-item1 counter1 1
HINCR customer-0001-item1 counter2 1
    


We get aggregated commands:


HINCR customer-0001-item1 counter1 3
HINCR customer-0001-item1 counter2 3    


Use Pipeline


The next step is a result of the in-memory update cache. We already accumulate many commands in the cache, so why wait for the redis reply for each command? Instead of running the commands one after the other, I've changed the in-memory udpate cache to use the redis pipeline. This reduces latency for updates as the microservice no longer waits each server reply for each command before sending the next command.



Reduce The Model Size


The microservice processes the customers requests, and save some of the information to the redis. This information is later processed by other microservices which transform the information to another representation, and send it to external services. The issue is that the transformed information is much smaller in size. So I've made the transformation in the microservice BEFORE saving it to the redis. This reduces both the redis updates commands size, and the memory footprint of the redis.
In general, examine your model. Do you need everything in this model? The size of the model is critical the the performance. In some cases, going back to the PM with a question: "This information costs 3$ per month per customer. Is it necessary for our product?"


Final Note

In this post I've presented the steps I've made to tune a microservice performance issue. Some of these steps could be identified in advance while implementing the service, but in real life, changing requirements and legacy code tend to accumulate these kinds of issues. It is important to stress test microservices performance before going live, as well as periodically revisit their performance.



No comments:

Post a Comment