Building an Enterprise Grade OpenSource Web Analytics System – Part 1: Architecture
Some time ago I wrote a litte series on how to amp up your log analytics activities. Ever since then I wanted to start another project building a fully fledged Analytics system with client side tracking and unlimited scalability out of OpenSource components. This is what this series is about, since I had some time to kill during Easter in isolation ?
This time, we will be using a tracker on the browser or mobile app of our users instead of logfiles alone, which is called client side tracking. That will give us a lot more information about our visitors and allow for some cool new use cases. It also is similar to how tools like Adobe Analytics or Google Analytics work. The data we collect has then to be processed and stored for analysis and future use.
As a client side tracker, we will be using the Snowplow tracker. While Snowplow has a whole ecosystem of analytics tools to offer, we want to build the processing part ourselves. Therefore we are only going to use the component that is responsible for sending tracking events to our infrastructure. Snowplow has a really nice set of libraries for browsers, mobile apps or server applications. This allows us to stay with Snowplow no matter what we are going to track.
Next, we need a system to actually capture all the data that we will be sending from the clients. In our previous project, Nginx and Filebeat have proven to be reliant and highly scaleable, so we will be using them here as well. Nginx will be the webserver to answer the client’s requests and write them to logfiles which will then be sent to the next stage by Filebeat.
While our setup is not that different from the first project until now, we have a new set of components once we have the logfiles gathered. This time, Filebeat will ship our data to a message queue, which will be Kafka. Kafka is one of the most used tools for communication between different software or system components. It is used a lot in big data systems and is the gold standard for those kind of systems due to scalability and reliability. It writes data to streams which can be fed and consumed by multiple systems at the same time.
The stream of raw tracking events will then be consumed by a data processing layer based on Python and Redis. This allows us to further enrich and modify our data without having to rely on the webservers. That way we can build a visitor profile to help us with analysis later on. Once the data is processed, we will write it back to Kafka in a new queue of processed events.
Last, but not least, our events will be read from the Kafka queues to be permanently stored. Since Kafka is so widely adopted, there is basically no limit on what database we can use. Oftentimes those streams are consumed by big data platforms like Hadoop, but we will look at Clickhouse, which is a database built for webanalytics specifically.
All in all our architecture will look like on the chart below. It is basically indefinitely scaleable and completely based on open sourced tools:

I will write a series of posts working through that architecture from the clients to the storage layer. While I cannot give a complete introduction to all the systems and tools, I will try to explain them as much as needed to get through our projects and learn something while doing it.
This series is intended for anyone looking at building such a system themselves. You also might find it interesting if you are curious about one or all of the individual components or large scale analytics systems.
So, let’s get our hands dirty and start with the Snowplow implementation in the next part!

German Analyst and Data Scientist working in and writing about (Web) Analytics and Online Marketing Tech.