Building your own Web Analytics from Log Files – Part 2: Architecture
This is the second part of the six-part-series “Building your own Web Analytics from Log Files”.
To start of this series, let’s remember what we want to achieve: We want to enable a deeper understanding of our website users by enriching and processing the log files we already collect. This article looks at the components we need for this and how to make our life as easy as possible.
To achieve our goal, we need to teach our web server to identify our users, store information about the activity in the log files, ship those files to storage and make it actionable with a way of visualizing it. Because I believe in Open Source Software, we will look at our options among that category. Another requirement is to introduce as less components as possible and keep scalability in mind.
Choosing our Web Server
The first part of our System Design is one that most likely will be already decided on when looking at existing websites. It is the component that actually interfaces with our Users: The Web Server. This component will be responsible for identifying our Visitors and logging their activity to the log files we are going to analyze.
A good starting point for that decision are usage statistics like https://news.netcraft.com/archives/2019/12/10/december-2019-web-server-survey.html. According to this site, our best choices at time of writing are:
- Apache still is the most relevant Web Server on the market. It satisfies all our requirements and has been used for almost every kind of project imaginable. For many people, Apache is synonymous with OSS Web Servers.
- Second on the list is Nginx. It has been gaining market share for a long time and could overtake Apache somewhere in 2020 according to our source. It also satisfies our requirements.
- Microsoft and LiteSpeed are not an option for this project, since they are severely limited in functionality and relevance.
So we have Apache and Nginx to choose from. Since we will be enhancing the functionality quite a bit, we need some easy extensibility. There is an awesome project called OpenResty, which extends Nginx through the Lua scripting language. That gives us a supercharged Nginx, which is exactly what we want. So this is our choice: We will be using OpenResty!
Analyzing Log Files
Now that we have a system that identifies our users and gives us some log files, we need a way to store and analyze them. With this project we are going to utilize the ELK stack, specifically Elasticsearch and Kibana. This combination has become a standard for log file analysis and other applications like system monitoring.
Elasticsearch is a very flexible and scalable system for storing and analyzing all kinds of data. It will eat almost anything we feed it, be it system metrics, log files, or product information. Entire monitoring systems or search engines are built around it.
Kibana will then be used to visualize the data we stored in Elasticsearch. It is specifically built to work with Elasticsearch. Let’s use that close coupling to our advantage!
Shipping Log Files
Now all we need is a way for our log files from OpenResty to Elasticsearch. While it would be simple to write our own integration, we should stick with the most common options.
Under different requirements, I would love to use NiFi for this. But since we might have quite a lot of web servers we would need to place NiFi on every server, largely increasing overhead on those machines. Configuration management would also be hard since NiFi is not built for a scenario like this.
Given that we will also use two thirds of the ELK stack, why not use the last part, Logstash? Logstash is awesome for processing log files because it is highly flexible and can be used for almost every use case. The downside comes in terms of size and resource requirements. It is quite a hefty package on our machines with a lot of functionality we will not need for this project.
Luckily, Elastic developed a stripped down version of Logstash with Beats. Beats are small parts of software with a highly specialized function. For System Metrics, there is Metricbeat. For our case we will be using Filebeat, which is specifically designed to collect logs from a lot of servers and ship them to a central Elasticsearch system (it can do other destinations as well, but Elasticsearch is best developed).
And here we are with our final setup! We will be using OpenResty as a web server. We will teach it to identify our users and gerate the data exactly how we need it. That data is written into log files, which will be picked up by Filebeat and stored in Elasticsearch. Once we have it there, we will build some dashboards in Kibana and try out some standard Analytics use cases.
The next parts of this series will look at the steps necessary for this step by step.
German Analyst and Data Scientist working in and writing about (Web) Analytics and Online Marketing Tech.
2021&2022 Adobe Analytics Champion