Building an Enterprise Grade OpenSource Web Analytics System – Part 6: Data Storage

This is the sixth part of a seven-part-series explaining how to build an Enterprise Grade OpenSource Web Analytics System. In this post we are taking a brief look on what we can do with the data we collected and processed with Clickhouse. In the previous post we built a persisted visitor profile for our visitors with Python and Redis. If you are new to this series it might help to start with the first post.

During this series we defined multiple topics within Kafka. Now we have different levels of processing and persistence available. If we want to keep any of it, we should put it in a persistent storage like a Data Lake with Hadoop or a Database. For this project, we are using Elasticsearch and dipping our toes in a database called Clickhouse for fun!

Feeding Data into Elasticsearch

From the previous part, we have a nice Kafka topic with flat and persisted tracking events. This is perfect for Elasticsearch but we need a way to get our Kafka topic into it. If you read Part 3 of this series you already know what we are going to use: Filebeat!

So we spin up another Filebeat instance and give it some minimal configuration:

filebeat.inputs:
- type: kafka
  hosts:
    - [Your Kafka instance]
  topics: ["tracking_persisted"]
  group_id: "filebeat"

output.elasticsearch:
  hosts: [Your Elasticsearch instance]
  index: "tracking_persisted"

processors:
  - decode_json_fields:
      fields: ["message"]
      process_array: true
      max_depth: 3
      target: "tracking"
      overwrite_keys: false
      add_error_key: true

Let’s go through this. Line 1 to 6 connect Filebeat to our Kafka topic. We are using the “tracking_persisted” topic, which contains the flattened events after they have been enriched by our user profile scripts.

Lines 8 through 10 now define where the data is supposed to go. We give it the host and port of our Elasticsearch instance and define which index we want to use.

In the last block from lines 12 to 19 we make our lives a little easier. Normally, Filebeat would just put our variables into one big text field called “message”. Here we tell Filebeat to treat that value as JSON, adding a field to the Elasticsearch document for each variable we collect from Snowplow. To keep those fields easily distinguishable we put it in the “tracking” namespace. This makes our documents look a little like this:

That was quite easy! Since we have our documents safely stored in Elasticsearch already, let’s try something cool on top.

Clickhouse for additional Data Storage

Clickhouse is a database specifically built for web analytics use cases. It is used by big companies like Yandex (which is also developing it) or Cloudflare, so it can handle large amounts of data quite fast. When you have an instance running, you can connect to it with a tool called Tabix. It runs locally in your browser and connects directly to Clickhouse. Within Tabix we find an interface that looks familiar if you worked with any database tool before.

Now let’s try to get some data from Kafka and show it in Tabix. Open a new SQL tab and connect to Kafka like this:

  CREATE TABLE queue (
    "kafka_raw-topic" String, "kafka_raw-key" String, "query-other_eventtype" String, "query-other_event_id" String, "query-other_domain_sessionid" String, "query-event_unstructured_schema" String
  ) ENGINE = Kafka('[Your Kafka instance]', 'tracking_persisted', 'clickhouse', 'JSONEachRow');

Let’s take a close look at that SQL. The first line creates a new table called queue, which will hold the data we receive from Kafka. We can choose which columns we want to take from our Kafka topic like in line 2. You can choose from any of the keys we used before or take the ones I used. Line 3 then connects Clickhouse to Kafka, specifying the host, topic, connection group and format. The settings from above work for me!

That’s actually all we need to do. Let’s query Clickhouse for that data from Kafka!

SELECT	* FROM	queue LIMIT 100

This will take the first 100 items from our Kafka topic and show it in a nice table like below:

It worked! We can see that Clickhouse actually receives the events after all the topics we send them trough. Now, the next steps highly depend on your actual use case for the data. If you want to keep it, you need to transfer it via a materialized view in Clickhouse.

That’s all for this part! In the last part, we will build a nice dashboard in Kibana based on our data in Elasticsearch.