Recently I've was working on several new projects, planned to run on a kubernetes cluster on a public cloud. Some of the projects were on their day 1 start of implementation, while others had a year of implementation behind them. So the PM pops the question:

"Let me know how much will this cost"

As much as the implementation is more mature, providing a precise answer to this question is more straight forward. In case the project is nearly ready, we can use a stress test, along with prometheus and grafana monitoring the cluster to provide an answer. However, there are some items that should be handled properly to fully estimate the cost.

Existing Implementation Estimation

For an existing project, we can use the following methods.

Stress Test Include Production Data

In case the solution should supply service to multiple customers, we should simulate data that we expect to get for both the amount of requests for each customer, and the amount of customers. Notice that in many cases the load by each customer is not even, hence the stress test should simulate an expected distribution of loads, for example: 10% of the simulated customers provide heavy load, 80% of the simulated customers provide medium load, and 10% of the customers provide low load.

Stress Test Compute

The stress test simulating tools require both high compute and high bandwidth, hence we should use the public cloud infrastructure to run them. A nice cloud service for this purpose is the cloud batches service which enable running containers on demand.

Estimation Sheet

To get a good estimation, we need to test the deployment using several configurations. For example, start with cost per customer for a deployment with 10 customers, and then check 100 customers, 1000 customers, and so on. We usually expect the price per customer to drop as the amount of customers increases. The final result is a graph whose X axis is the amount of customers, and its Y axis is the cost per customer.

Cost Per Service

A cost estimation process is not a one time task. It usually includes running the stress tests, getting an estimation sheet, and then trying to reduce the cost by changing the highest compute demanding service, and re-running the cost again and again. To provide an understanding of the cost, we need to supply cost details for each of the services. This includes: amount of pods, average memory per pod, average CPU per pod.

Total Costs

Note that while the tests are mostly toward compute resources (CPU and memory), there are other costs which play a big part in the total sum. These include the following.

Network traffic cost, which is usually free for incoming traffic, and paid for egress traffic. Notice that in case we use VPN between regions, there is cost for ingress traffic as well.

Storage cost, which in most cases is low, due to relatively low cost in the public cloud providers.

API cost, which much be high in case of public cloud. Notice that we might have a low cost on storage, but high cost on API access to store and retrieve the data.

Other cloud services costs, that include for example the kubernetes cluster itself, and might include other services such as secure keys management, and databases.

A special notice should be taken for Ingress, which is actually implemented as a load-balancer, and hence there is an up time cost as well as a per API cost.

Estimation Before Implementation

The previous sections discussed of stress test to an existing deployment, but what do we do if we just started the development, and we do not have an implementation ready?

The answer is guess!

But...

Do not guess in the wild, but instead use guidelines from other implementations. Try estimating the amount of incoming APIs calls to the services, and the database accesses amount you'll need.

In general, a good guideline for a 2G RAM and 2 CPU pod with minor database updates is being able to handle about 500 requests per second. This is a very rough estimation, but unless we have other information, we can use this guess baseline.

Final Note

While cost estimation is a long process, we should make our best effort to automate and document it, as it will probably be used not only once, but also later in the product lifetime, and maybe by another team...

Also, once results are shown, make sure to back them with all the data that led to the results, as questions will pop up by the PM or by other teams. For example: which public cloud did you used? which compute instances? what was the stress requests per second? Why did you choose this load?

Full Blog TOC

Full Blog Table Of Content with Keywords Available HERE

Monday, January 9, 2023

Cost Estimation For New Cloud Based Projects