Building your own Web Analytics from Log Files – Part 4: Data Collection and Processing
This is the fourth part of the six-part-series “Building your own Web Analytics from Log Files”.
Legal Disclaimer: This post describes how to identify and track the users on your website using cookies, IP adresses and browser fingerprinting. The information and process described here may be subject to data privacy regulations under your legislation. It is your responsibility to comply with all regulations. Please educate yourself if things like GDPR apply to your use case (which is very likely), and act responsibly.
In the last part we have built a configuration for OpenResty to generate user and session IDs and store them in browser cookies. Now we need a way to actually log and collect those IDs together with the requests our web server handles.
OpenResty Configuration
To be able to log our custom variables we need to announce them to Nginx. This is done right in the server-part of the configuration. We need variables for the user and session IDs, if the user or session is new and for the query parameters:
server {
listen 8888;
server_name localhost;
set $a_visitor_id '';
set $a_visitor_new '';
set $a_session_id '';
set $a_session_entry '';
set $a_query_args '';
In our header_filter_by_lua_block block we can now assign the values from the previous part to those variables:
ngx.var.a_visitor_id = a_visitor_id
ngx.var.a_visitor_new = a_visitor_new
ngx.var.a_session_id = a_session_id
ngx.var.a_session_entry = a_session_entry
ngx.var.a_query_args = a_query_args
Now Nginx knows about the variables and we are able to use them in our logging configuration. This is done in the http-part of the config. We define a new format called “tracking” and set an access_log for it. By doing this, we have a log line for every request to the web server in the format we specify. To make our life easy, we put our logs in a JSON format. We include a lot of useful variables we would log anyway plus our tracking variables:
log_format tracking escape=none '{"remote_addr":"$remote_addr","remote_user":"$remote_user","scheme":"$scheme","uri":"$uri","connection_request":"$connection_requests","connection":"$connection","msec":"$msec","pid":"$pid","host":"$host","server_name":"$server_name","time":"$time_iso8601","request":"$request","status":"$status","size":"$body_bytes_sent","query_string":"$query_string","referer":"$http_referer","user_agent":"$http_user_agent","proxy_ip":"$http_x_forwarded_for","visitor_id":"$a_visitor_id","new_visitor":$a_visitor_new,"session_id":"$a_session_id","session_entry":$a_session_entry,"query_array":$a_query_args}';
access_log logs/access-tracking.log tracking;
At the end, we have an Nginx configuration with all the parts we need:
http {
log_format tracking escape=none '{"remote_addr":"$remote_addr","remote_user":"$remote_user","scheme":"$scheme","uri":"$uri","connection_request":"$connection_requests","connection":"$connection","msec":"$msec","pid":"$pid","host":"$host","server_name":"$server_name","time":"$time_iso8601","request":"$request","status":"$status","size":"$body_bytes_sent","query_string":"$query_string","referer":"$http_referer","user_agent":"$http_user_agent","proxy_ip":"$http_x_forwarded_for","visitor_id":"$a_visitor_id","new_visitor":$a_visitor_new,"session_id":"$a_session_id","session_entry":$a_session_entry,"query_array":$a_query_args}';
access_log logs/access-tracking.log tracking;
server {
listen 8888;
server_name localhost;
set $a_visitor_id '';
set $a_visitor_new '';
set $a_session_id '';
set $a_session_entry '';
set $a_query_args '';
header_filter_by_lua_block {
a_visitor_id = ngx.var["cookie_a_vid"]
a_visitor_new = "false"
if not a_visitor_id then
a_visitor_id = ngx.md5(ngx.var.remote_addr..ngx.var.http_user_agent..(ngx.req.get_headers()["Accept"] or '')..(ngx.req.get_headers()["Accept-Encoding"] or '')..(ngx.req.get_headers()["Accept-Language"] or ''))
a_visitor_new = "true"
end
a_session_id = ngx.var["cookie_a_sid"]
a_session_entry = "false"
if not a_session_id then
a_session_id = ngx.md5(ngx.var.remote_addr..ngx.var.http_user_agent..(ngx.req.get_headers()["Accept"] or '')..(ngx.req.get_headers()["Accept-Encoding"] or '')..(ngx.req.get_headers()["Accept-Language"] or '')..ngx.var.msec)
a_session_entry = "true"
end
ngx.header["Set-Cookie"] = {"a_vid="..a_visitor_id.."; Path=/; Expires=" .. ngx.cookie_time(ngx.time() + 63072000), "a_sid="..a_session_id.."; Path=/; Expires=" .. ngx.cookie_time(ngx.time() + 1800)}
local args, err = ngx.req.get_uri_args()
arr = {}
a_query_args = "{"
for key, val in pairs(args) do
if type(val) == "table" then
table.insert(arr,'"'..key..'":"'.. table.concat(val, "|") ..'"')
else
table.insert(arr,'"'..key..'":"'..val..'"')
end
end
a_query_args = a_query_args .. table.concat(arr, ", ")
a_query_args = a_query_args .. "}"
ngx.var.a_visitor_id = a_visitor_id
ngx.var.a_visitor_new = a_visitor_new
ngx.var.a_session_id = a_session_id
ngx.var.a_session_entry = a_session_entry
ngx.var.a_query_args = a_query_args
}
}
}
This gives us log lines like this:
{"remote_addr":"127.0.0.1","remote_user":"","scheme":"http","uri":"/index.html","connection_request":"1","connection":"2","msec":"1578043737.950","pid":"3080","host":"127.0.0.1","server_name":"localhost","time":"2020-01-03T10:28:57+01:00","request":"GET / HTTP/1.1","status":"304","size":"0","query_string":"","referer":"","user_agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36","proxy_ip":"","visitor_id":"7c958ba7becea5f5bea54a80b899330e","new_visitor":false,"session_id":"c5d80b3d6f3f4706cac4eea0a8ad2f12","session_entry":false,"query_array":{}}
For a new user, it could look like this. Consider the session_entry and new_visitor keys:
{"remote_addr":"127.0.0.1","remote_user":"","scheme":"http","uri":"/index.html","connection_request":"1","connection":"1","msec":"1578141208.707","pid":"12660","host":"127.0.0.1","server_name":"localhost","time":"2020-01-04T13:33:28+01:00","request":"GET / HTTP/1.1","status":"200","size":"674","query_string":"","referer":"","user_agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36","proxy_ip":"","visitor_id":"b9323e1cd1379aeb504d40f0ce9c822c","new_visitor":true,"session_id":"607a9bce4d67e4d6c0e9b98848d7f9d1","session_entry":true,"query_array":{}}
Filebeat Configuration
To ship our logfiles to Elasticsearch we need to tell Filebeat where our logfile is, that it is in JSON format and where to ship it. For our project the config is as simple as, given that we want to organize our log data in daily indices:
filebeat.inputs:
- type: log
enabled: true
paths:
- PATH-TO-LOGS\access-tracking.log
json.add_error_key: true
setup.ilm.enabled: false
setup.template.name: "custom-filebeat-tracking-logs-%{[agent.version]}"
setup.template.pattern: "custom-filebeat-tracking-logs-%{[agent.version]}-*"
output.elasticsearch:
hosts: ["ES-HOST"]
index: "custom-filebeat-tracking-logs-%{[agent.version]}-%{+yyyy.MM.dd}"
You need to modify the path to the log file, the Elasticsearch host and the index format. The format here gives us indices like “
custom-filebeat-tracking-logs-7.4.0-2020.01.03″.
Now we are able to track our users, generate some log files with user information and ship it to Elasticsearch.
In the next part we will start working with the data and build a nice dashboard out of it.

German Analyst and Data Scientist working in and writing about (Web) Analytics and Online Marketing Tech.