This is the sixth part of a seven-part-series explaining how to build an Enterprise Grade OpenSource Web Analytics System. In this post we are taking a brief look on what we can do with the data we collected and processed with Clickhouse. In the previous post we built a persisted visitor profile for our visitors with Python and Redis. If you are new to this series it might help to start with the first post. During this series we defined multiple topics within Kafka. Now we have different levels of processing and persistence available. If we want to keep any of it, we should put it in a persistent storage like a Data Lake with Hadoop or a Database. For this project, we are using Elasticsearch and dipping our toes in a database called Clickhouse for fun! Feeding Data into Elasticsearch From the previous part, we have a nice Kafka […]
Tag: Kafka
Building an Enterprise Grade OpenSource Web Analytics System – Part 5: Visitor Profile
This is the fifth part of a seven-part-series explaining how to build an Enterprise Grade OpenSource Web Analytics System. In this post we are going to build a visitor profile to persist some of the data we track with Python and Redis. In the last post we processed the raw data using Python and wrote it back to Kafka. If you are new to this series it might help to start with the first post. Now that we have a nice processed version of our events, we want to remember certain things about our users. To do this, we are going to create a Visitor Profile in Redis as high performance storage. The process for persisting values will look like this: Building our Visitor Profile First things in this part, we are setting up a little helper script that will take our processed tracking events and flatten them. It looks […]
Building an Enterprise Grade OpenSource Web Analytics System – Part 4: Data Processing
This is the fourth part of a seven-part-series explaining how to build an Enterprise Grade OpenSource Web Analytics System. In this post we are building the processing layer to work with our raw log lines. In the last post we used Nginx and Filebeat to write our tracking events to Kafka. If you are new to this series it might help to start with the first post. At this part of the series, we have a lot of raw tracking events in our Kafka topic. We could already use this topic to store the raw loglines to our Hadoop cluster or a database. But it would be much easier later on to do some additional processing to make our life a litte easier. Since Python is the data science language today we will be using that language. The result will then be written to another Kafka topic for further processing […]
Building an Enterprise Grade OpenSource Web Analytics System – Part 3: Data Collection
This is the third part of a seven-part-series explaining how to build an Enterprise Grade OpenSource Web Analytics System. In this post we are setting up the tracking backend with Nginx and Filebeat. In the last post we took care of the client side implementation of Snowplow Analytics. If you are new to this series it might help to start with the first post. Now that we have a lot of data that is being sent from our clients, we need to build a backend to take care of all the events we want. Since we are sending our requests unencoded via GET, we can just configure our web server to write all requests to a logfile and send them off to the processing layer. Configuring Nginx with Filebeat In our last project we used a configuration just like the one we need. As web server, we used and will […]
Building an Enterprise Grade OpenSource Web Analytics System – Part 1: Architecture
Some time ago I wrote a litte series on how to amp up your log analytics activities. Ever since then I wanted to start another project building a fully fledged Analytics system with client side tracking and unlimited scalability out of OpenSource components. This is what this series is about, since I had some time to kill during Easter in isolation ? This time, we will be using a tracker on the browser or mobile app of our users instead of logfiles alone, which is called client side tracking. That will give us a lot more information about our visitors and allow for some cool new use cases. It also is similar to how tools like Adobe Analytics or Google Analytics work. The data we collect has then to be processed and stored for analysis and future use. As a client side tracker, we will be using the Snowplow tracker. […]