run KISS: August 2025

Sunday, August 31, 2025

UV

In this post we will review the uv - a new python project manager.

I am not a heavy python user. I generally avoid using python for long-term existing projects as its maintainability is much complex due to its limited variable typing, and due to its single core usage. I usually use python for very short lived projects or for LLMs which get support almost only in python. Over the years I got used to all python pains:

Installation and usage the correct version of python and pip
Creation of the python VENV
Dependencies management using the requirement.txt which somehow never works

And then, about a year ago, a new tool emerged: uv.

The uv provides a complete solution for the entire python project management. It includes:

Python version installation and management
Dependencies add and locking
New project creation
VENV management
Running helper tools

The funny thing about uv is that it is written in RUST, which is in my opinion a kind of an insult to python.

Anyways, listed below are some basic uv actions.

Create A New Project

To create a new project run the following commands.

mkdir demo
cd demo
uv init
uv python list
uv python pin 3.12
uv env

In case using PyCharm, configure it to use uv:

(make sure you have latest version of PyCharm)

PyCharm Settings --> Python --> Interpreter --> select the existing from the .venv

Dependencies

Adding dependecies is simple, for example, add flask.

uv add flask
uv lock

Notice the uv.lock is automatically updated upon any additional add of dependency.

Unit Test and Converage

To run unit test and converage, add the dependencies as DEV dependencies, and then run the related tests.

uv add --dev pytest coverage
uv run -m coverage run -m pytest
uv run -m coverage report

Example of tests are below.

main.py

def add(a, b):
  return a + b

test_main.py

from main import add

def test_add():
  assert add(2, 3) == 5

Saturday, August 23, 2025

HyperLogLog

I this post we will review the HyperLogLog algorithm (HLL) and its extension the HyperLogLog++ algorithm (HLL++).

The Goal

The HLL goal is to find the cardinality of a set. For example, lets assume we get requests from multiple source IPs, and we want to count the distinct amount of source IPs.

A Naive Approach

The naive solution would be to keep a map of source IPs, for example:

sourceIps:= make(map[string]bool)
for _,sourceIp:= range incomingRequests {
  sourceIps[sourceIp] = true
}

fmt.Printf("sourceIps amount is: %v\n", sourceIps)

The problem in such a solution is its scale. If we have 1G source IPs this counter would required about:

1G entries * size of entry =~ 10 GB RAM

Hence for very large amount of items this is not feasible.

Leading Zeros

The HLL base assumption is the probability of leading zeros. If we assume our items are uniformly distributed we can estimate the amount of items by checking the max leading zeros in a bits representation of the items. To get a uniformly distributed input we use a hash function.

For example, let's assume our input is a random 16 bits numbers.

The probability to get an item with one leading zero: 0xxx xxxx xxxx xxxx is 0.5.

The probability to get an item with two leading zeros: 00xx xxxx xxxx xxxx is 0.25.

The probability to get an item with three leading zeros: 000x xxxx xxxx xxxx is 0.125.

...

The probability to get an item with k leading zeros: 000x xxxx xxxx xxxx is 0.5^k.

Hence if we only count the max leading zeros in our input by keeping the max leading zeros encountered.

If we have max of 1 leading zeros, we assume cardinality of 2 items.

If we have max of 2 leading zeros, we assume cardinality of 4 items.

If we have max of 3 leading zeros, we assume cardinality of 8 items.

...

If we have max of k leading zeros, we assume cardinality of 2^k items.

This is of course an estimation and is prone to errors.

Buckets

To improve the estimation, HLL uses buckets.

Buckets are groups input items that for each we keep track of the max seen leading zeros. The amount of buckets used is 2^p, where p is the first p bits of the input.

For example:

Lets assume p=3, hence we have 2^3=8 buckets, and review a sequence of inputs.

For input 0010 0001 0001 0001 we have 2 leading zeros.
The first 3 bits of the input are 001, hence we update bucket1=2 max leading zeros.

For input 0110 0001 0001 0001 we have 1 leading zeros.
The first 3 bits of the input are 011, hence we update bucket3=1 max leading zeros.

For input 0000 0001 0001 0001 we have 7 leading zeros.
The first 3 bits of the input are 000, hence we update bucket0=7 max leading zeros.

For input 0000 1111 1111 1111 we have 4 leading zeros.
The first 3 bits of the input are 000, but bucket0 already has max leading zeros = 7 so we do not update it.

so the results are:

bucket0 has max 7 leading zeros --> probability 2^-7

bucket2 has max 2 leading zeros --> probability 2^-2 = 0.25

bucket3 has max 1 leading zeros --> probability 2^-1 = 0.5

Now we use harmonic mean to estimate the amount of values.

H = number of buckets / sum( probability bucket i)

H = 8 buckets / (2^-7 + 0.25 + 0.5) = 0.7578125

Estimated Cardinality

Once we have the harmonic mean, we use the following to estimate the cardinality:

alpha = 0.7213 / (1 + 1.079 / buckets) = 0.7213 / (1 + 1.079 / 8) = 0.635576605

The 0.7213 and 1.079 are constants used for alpha regardless of the buckets count. There were provided as part of research on calibrating the HLL.

E = alpha * bucket^2 * H = 0.635576605 * 8^2 * 0.7578125 =~ 30

Hence the estimate cardinality is 30, which is far from the actual cardinality which is 4 items in out example.

Notice the HLL performs bad in small sets, but in large sets the accuracy is ~2% error.

Memory Requirements

Unlike the naive solution, HLL requires memory only for the buckets. A common configuration is to use 8 bits for the registers which translates to ~16K counters.

HLL++

HLL++ is an improvement by Google over HLL.

It uses special algorithm for small datasets, and automatically switch to HLL once the input data is large enough.

It uses Sparse representation for the buckets to reduce memory footprint for small datasets.

In addition, HLL++ uses automatic bias correct according to the dataset size.

HLL++ Implementation Example

The following example uses the lytics HLL library.


import (
  "crypto/sha1"
  "encoding/binary"
  "fmt"
  "github.com/lytics/hll"
)

func CheckHllPlusPlus() {
  const p = 14      // Max memory usage is 0.75 * 2^p bytes
  const pPrime = 25 // Setting this is a bit more complicated, Google recommends 25.
  myHll := hll.NewHll(p, pPrime)

  myHash := func(value string) uint64 {
   sha1Hash := sha1.Sum([]byte(value))
   return binary.LittleEndian.Uint64(sha1Hash[0:8])
  }

  for i := range 1000 {
   value := fmt.Sprintf("%v", i)
   hashed := myHash(value)
   myHll.Add(hashed)
  }

  fmt.Printf("estimated cardinality %v\n", myHll.Cardinality())

}

Sunday, August 17, 2025

ClickHouse Concepts

In this post we will review some of the key concepts in ClickHouse.

Projection

A projection is a live physically stored query result for a specific table.
The goal of the projection is to get the query results ~zero time while adding more computation time upon the inserts to the table.
The projection is updated upon insert the the source table.
Example for a projection creation:

CREATE TABLE IF NOT EXISTS my_table (
    application_id String,
    event_id String,
    event_type String,
    event_time DateTime,
    event_text String

    PROJECTION events_per_type_and_day (
       SELECT 
          application_id,
            event_type,
          event_day as toYYYYMMDD(event_time),
          count(*)
       GROUP BY
            application_id,
            event_type,
            event_day as toYYYYMMDD(event_time)
    )
)

The projection can be created even after the table is created, but then it will contain only the data added after the projection creation. To fully sync the projection use the materialize command:

ALTER TABLE table MATERIALIZE PROJECTION events_per_type_and_day

Notice the client can still fetch from the original table, but then the ClickHouse for efficiency will decide using the projection.

Explain Plan

Explain plan is a command that requests the ClickHouse to describe the method that is about to run for a specific query. For example, we can run the query related the the projection above and verify that the ClickHouse will indeed use the projection. A nice thing is that we can use the ClickHouse HTTP 8123 port to ask for this, for example:

curl -u username:password 'http://localhost:8123/' \
  -d "
EXPLAIN PLAN
SELECT
    application_id,
    event_type,
    event_day as toYYYYMMDD(event_time),
    count(*)
FROM my_table
GROUP BY
    application_id,
    event_type,
    event_day as toYYYYMMDD(event_time);
"

Materialized Column

A materialized column is a column that is automatically calculated upon insert command. This enables us to add column that otherwise we would had to update the code of the process inserting new records. This is great if we want to add a column that would be later used in queries and we want to avoid calculating it every time we run the query. For example:

CREATE TABLE IF NOT EXISTS my_table (
    application_id String,
    event_id String,
    event_type String,
    event_time DateTime,
    event_text String,

    event_hour String MATERIALIZED formatDateTime(EVENT_TIME, '%Y-%m-%dT%H:00:00'),
)

TTL

Table TTL enables us to automatically remove old data from the table. It is great when used along side with partition that is build using the record time. Notice that ClickHouse does not guarantee the delete of the stable data exactly upon the TTL but it will be done eventually. Example for TTL combined with partitioning:

CREATE TABLE IF NOT EXISTS my_table (
    application_id String,
    event_id String,
    event_type String,
    event_time DateTime,
    event_text String,
)
PARTITION BY event_time
TTL event_time + INTERVAL 7 DAY DELETE

Skip Index

Skip index is not the standard index we know from SQL DBMS. To understand it we need to understand the internal ClickHouse persistence method.

ClickHouse stored data in granules.
Each granule stores 8K records.

Skip index is an additional data per granule that let assist in the decision whether the granule is relevant for the query.

There are different types of metadata the skip index can store.

For a number column it could store the min and max values in the granule, then we can decide if the granule is relevant for our query by the value in the WHERE clause.

There are also skip index types for set of string values, such as set and bloom filter.

Example for skip index:

ALTER TABLE my_table ADD INDEX application_id_index application_id TYPE bloom_filter(0.01) GRANULARITY 1

Merge Tree Variants

Merge tree is the ClickHouse storage engine. It is dedicated for the fact that we want only to insert new records and not to update and delete records. To support this working methodology we instruct ClickHouse to merge duplicate records by the primary key.

Notice: Unlike standard SQL DBMS the primary key is not enforced but only used for later merge actions.

For example, this would ensure we have only one record for a part:

CREATE TABLE IF NOT EXISTS my_parts (
    item_id String,
    part_id String,
    price int32,
    update_time DateTime
)
ENGINE=ReplacingMergeTree(update_time)
ORDER BY (
    item_id,
    part_id
)

Notice the merge action occurs periodically, and between the insert and the query we still might have more than a single row for the same primary key. We can use the keyword FINAL to get merged results even the merge was not run yet:

select * from my_parts FINAL

We've used the replace merge method, but there are several variants for the merge method such as sum, aggregation, version collapse and more.

Materialized View

Materialized view populates a table based on results of query from an origin table. It is similar to projection but it create storage a part of a different table, and hence it is not transparent to the client querying from the ClickHouse which must explicitly specify the materialized view target table.

For example:

CREATE TABLE IF NOT EXISTS my_events (
    application_id String,
    event_id String,
    event_type String,
    event_time DateTime,
    event_text String
);

CREATE MATERIALIZED VIEW my_events_by_day
ENGINE = SummingMergeTree
ORDER BY (
    application_id,
    event_type,
    event_day
)
AS
SELECT
    application_id,
    event_type,
    toDate(event_time) as event_day,
    count() AS count
FROM my_events
GROUP BY
    application_id,
    event_type,
    event_day
;

Wednesday, August 6, 2025

ClickHouse API Wrapper and Stub in Go

In this post we will create a GO wrapper for ClickHouse API. We will also create a stub that can be used in the tests.

You can also use the related post Deploy ClickHouse on Kubernetes without Operator.

First we create the ClickHouse interface.

package clickhouse

type ClickhouseApi interface {
  ExecuteStatement(
   statement string,
   args ...any,
  )
  FetchRows(
   statement string,
   results any,
   args ...any,
  )
}

Example for usage is:

  type AgentCall struct {
   EventId     string    `ch:"EVENT_ID"`
   PoId        string    `ch:"PO_ID"`
   AgentName   string    `ch:"AGENT"`
   SourceIp    string    `ch:"SOURCE_IP"`
   RawRequest  string    `ch:"RAW_REQUEST"`
   RawResponse string    `ch:"RAW_RESPONSE"`
   EventTime   time.Time `ch:"EVENT_TIME"`
   ReportOnly  bool      `ch:"REPORT_ONLY"`
   IsVerbose   bool      `ch:"VERBOSE"`
  }
  const SelectByPoIdAndAgent = `
SELECT
  EVENT_ID,
  PO_ID,
  AGENT,
  SOURCE_IP,
  EVENT_TIME,
  REPORT_ONLY,
  VERBOSE,
  RAW_REQUEST,
  RAW_RESPONSE
FROM AGENT_CALL 
WHERE PO_ID = ?
AND AGENT = ?
`
  var agentsCalls []AgentCall
  clickHouse.FetchRows(SelectByPoIdAndAgent, &agentsCalls, poId, agentName)

The ClickHouse implementation is very strait forward, but notice the Ping before accepting the connection as valid, otherwise we will return the connection even if it is invalid.

package clickhouse

import (
  "context"
  "fmt"
  "github.com/ClickHouse/clickhouse-go/v2/lib/driver"

  "github.com/ClickHouse/clickhouse-go/v2"
)

type ClickhouseImpl struct {
  connection driver.Conn
}

func ProduceClickhouseImpl() *ClickhouseImpl {
  address := fmt.Sprintf("%v:%v", Config.ClickhouseHost, Config.ClickhousePort)
  log.Info("clickhouse connection to %v@%v", Config.ClickhouseUser, address)
  connection, err := clickhouse.Open(&clickhouse.Options{
   Addr: []string{address},
   Auth: clickhouse.Auth{
    Database: "default",
    Username: Config.ClickhouseUser,
    Password: Config.ClickhousePassword,
   },
  })
  kiterr.RaiseIfError(err)

  serverVersion, err := connection.ServerVersion()
  log.Info("clickhouse server version: %v", serverVersion)

  err = connection.Ping(context.Background())
  kiterr.RaiseIfError(err)

  return &ClickhouseImpl{
   connection: connection,
  }
}

func (c *ClickhouseImpl) ExecuteStatement(
  statement string,
  args ...any,
) {
  err := c.connection.Exec(context.Background(), statement, args...)
  kiterr.RaiseIfError(err)
}

func (c *ClickhouseImpl) FetchRows(
  statement string,
  results any,
  args ...any,
) {
  err := c.connection.Select(context.Background(), results, statement, args...)
  kiterr.RaiseIfError(err)
}

The ClickHouse stub is a bit more complex since we need to play with reflections.

package clickhouse

import (
  "fmt"
  "reflect"
)

type StubFetcher func(
  statement string,
  args ...any,
) []interface{}

type ClickhouseStub struct {
  StubFetcher StubFetcher
}

func ProduceClickhouseStub(
  stubFetcher StubFetcher,
) *ClickhouseStub {
  return &ClickhouseStub{
   StubFetcher: stubFetcher,
  }
}

func (c *ClickhouseStub) ExecuteStatement(
  statement string,
  args ...any,
) {

}

func (c *ClickhouseStub) FetchRows(
  statement string,
  dest any,
  args ...any,
) {
  results := c.StubFetcher(statement, args...)
  value := reflect.ValueOf(dest)
  if value.Kind() != reflect.Ptr {
   kiterr.RaiseIfError(fmt.Errorf("destination must be a pointer"))
  }
  if value.IsNil() {
   kiterr.RaiseIfError(fmt.Errorf("destination must not be nil"))
  }
  direct := reflect.Indirect(value)
  if direct.Kind() != reflect.Slice {
   kiterr.RaiseIfError(fmt.Errorf("destination must be a slice"))
  }

  for _, result := range results {
   direct.Set(reflect.Append(direct, reflect.ValueOf(result)))
  }
}

Deploy ClickHouse on Kubernetes without Operator

Clickhouse is a fast SQL based database. To deploy it we can use the clickhouse operator. This however is an overkill for a small deployment such as a deployment for local testing and debug.

This post describes deployment of a single pod clickhouse without the need of an operator.

We start by creating a role and a service account.

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: my-clickhouse-role
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: my-clickhouse-service-account
  namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: my-clickhouse-role-binding
subjects:
  - kind: ServiceAccount
    name: my-clickhouse-service-account
    namespace: default
roleRef:
  kind: ClusterRole
  name: my-clickhouse-role
  apiGroup: rbac.authorization.k8s.io

Next we add the a config map that includes the required configuration files.

apiVersion: v1
kind: ConfigMap
metadata:
  name: my-clickhouse-config
    
    
data:
  custom.xml: |-
    <clickhouse>
        <logger>
            <level>information</level>
            <console>true</console>
        </logger>
    </clickhouse>
  users.xml: |-
    <clickhouse>
        <users>
            <default>
                <password></password>
                <networks>
                    <ip>::/0</ip>
                </networks>
            </default>
        </users>
    </clickhouse>

To make the tables persistent we add a persistent volume to hostpath.

apiVersion: v1
kind: PersistentVolume
metadata:
  name: my-clickhouse-pv-0
  labels:
    type: local
spec:
  storageClassName: hostpath-my-clickhouse
  capacity:
    storage: 100Mi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: "/mnt/hostpath/my-clickhouse"

We expose the clickhouse port as well as the HTTP based port in a service.

apiVersion: v1
kind: Service
metadata:
  name: my-clickhouse-service
spec:
  selector:
    configid: my-clickhouse-container
  type: ClusterIP
  ports:
    - name: native
      port: 9000
      targetPort: 9000
    - name: http
      port: 8123
      targetPort: 8123

And finally we add the statefulset to run the clickhouse.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: my-clickhouse-statefulset
spec:
  serviceName: my-clickhouse-service
  replicas: 1
  selector:
    matchLabels:
      configid: my-clickhouse-container
  template:
    metadata:
      labels:
        configid: my-clickhouse-container        
        
        
    spec:
      serviceAccountName: my-clickhouse-service-account
      containers:
        - name: my-clickhouse
          image: clickhouse/clickhouse-server:latest
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 9000
              name: native
            - containerPort: 8123
              name: http
          volumeMounts:
            - name: pvc
              mountPath: /var/lib/clickhouse
            - name: my-clickhouse-config
              mountPath: /etc/clickhouse-server/config.d/custom.xml
              subPath: custom.xml
            - name: my-clickhouse-config
              mountPath: /etc/clickhouse-server/users.d/users.xml
              subPath: users.xml
      volumes:
        - name: my-clickhouse-config
          configMap:
            name: my-clickhouse-config
  volumeClaimTemplates:
    - metadata:
        name: pvc
      spec:
        accessModes: [ "ReadWriteOnce" ]
        storageClassName: "hostpath-my-clickhouse"
        resources:
          requests:
            storage: 100Mi

Full Blog TOC

Full Blog Table Of Content with Keywords Available HERE