Full Blog TOC

Full Blog Table Of Content with Keywords Available HERE

Sunday, August 17, 2025

ClickHouse Concepts

 


In this post we will review some of the key concepts in ClickHouse.


Projection

A projection is a live physically stored query result for a specific table.
The goal of the projection is to get the query results ~zero time while adding more computation time upon the inserts to the table.
The projection is updated upon insert the the source table.
Example for a projection creation:


CREATE TABLE IF NOT EXISTS my_table (
application_id String,
event_id String,
event_type String,
event_time DateTime,
event_text String

PROJECTION events_per_type_and_day (
SELECT
application_id,
event_type,
event_day as toYYYYMMDD(event_time),
count(*)
GROUP BY
application_id,
event_type,
event_day as toYYYYMMDD(event_time)
)
)


The projection can be created even after the table is created, but then it will contain only the data added after the projection creation. To fully sync the projection use the materialize command:

ALTER TABLE table MATERIALIZE PROJECTION events_per_type_and_day


Notice the client can still fetch from the original table, but then the ClickHouse for efficiency will decide using the projection.

Explain Plan

Explain plan is a command that requests the ClickHouse to describe the method that is about to run for a specific query. For example, we can run the query related the the projection above and verify that the ClickHouse will indeed use the projection. A nice thing is that we can use the ClickHouse HTTP 8123 port to ask for this, for example:


curl -u username:password 'http://localhost:8123/' \
-d "
EXPLAIN PLAN
SELECT
application_id,
event_type,
event_day as toYYYYMMDD(event_time),
count(*)
FROM my_table
GROUP BY
application_id,
event_type,
event_day as toYYYYMMDD(event_time);
"


Materialized Column

A materialized column is a column that is automatically calculated upon insert command. This enables us to add column that otherwise we would had to update the code of the process inserting new records. This is great if we want to add a column that would be later used in queries and we want to avoid calculating it every time we run the query. For example:


CREATE TABLE IF NOT EXISTS my_table (
application_id String,
event_id String,
event_type String,
event_time DateTime,
event_text String,

event_hour String MATERIALIZED formatDateTime(EVENT_TIME, '%Y-%m-%dT%H:00:00'),
)


TTL

Table TTL enables us to automatically remove old data from the table. It is great when used along side with partition that is build using the record time. Notice that ClickHouse does not guarantee the delete of the stable data exactly upon the TTL but it will be done eventually. Example for TTL combined with partitioning:

CREATE TABLE IF NOT EXISTS my_table (
application_id String,
event_id String,
event_type String,
event_time DateTime,
event_text String,
)
PARTITION BY event_time
TTL event_time + INTERVAL 7 DAY DELETE

Skip Index

Skip index is not the standard index we know from SQL DBMS. To understand it we need to understand the internal ClickHouse persistence method.

ClickHouse stored data in granules.
Each granule stores 8K records.
Skip index is an additional data per granule that let assist in the decision whether the granule is relevant for the query.

There are different types of metadata the skip index can store.
For a number column it could store the min and max values in the granule, then we can decide if the granule is relevant for our query by the value in the WHERE clause.
There are also skip index types for set of string values, such as set and bloom filter.

Example for skip index:

ALTER TABLE my_table ADD INDEX application_id_index application_id TYPE bloom_filter(0.01) GRANULARITY 1


Merge Tree Variants

Merge tree is the ClickHouse storage engine. It is dedicated for the fact that we want only to insert new records and not to update and delete records. To support this working methodology we instruct ClickHouse to merge duplicate records by the primary key.

Notice: Unlike standard SQL DBMS the primary key is not enforced but only used for later merge actions.

For example, this would ensure we have only one record for a part:

CREATE TABLE IF NOT EXISTS my_parts (
item_id String,
part_id String,
price int32,
update_time DateTime
)
ENGINE=ReplacingMergeTree(update_time)
ORDER BY (
item_id,
part_id
)

Notice the merge action occurs periodically, and between the insert and the query we still might have more than a single row for the same primary key. We can use the keyword FINAL to get merged results even the merge was not run yet:

select * from my_parts FINAL

We've used the replace merge method, but there are several variants for the merge method such as sum, aggregation, version collapse and more.

Materialized View

Materialized view populates a table based on results of query from an origin table. It is similar to projection but it create storage a part of a different table, and hence it is not transparent to the client querying from the ClickHouse which must explicitly specify the materialized view target table.

For example:

CREATE TABLE IF NOT EXISTS my_events (
application_id String,
event_id String,
event_type String,
event_time DateTime,
event_text String
);

CREATE MATERIALIZED VIEW my_events_by_day
ENGINE = SummingMergeTree
ORDER BY (
application_id,
event_type,
event_day
)
AS
SELECT
application_id,
event_type,
toDate(event_time) as event_day,
count() AS count
FROM my_events
GROUP BY
application_id,
event_type,
event_day
;






No comments:

Post a Comment