Building your own Web Analytics from Log Files – Part 4: Data Collection and Processing

This is the fourth part of the six-part-series “Building your own Web Analytics from Log Files”.

Legal Disclaimer: This post describes how to identify and track the users on your website using cookies, IP adresses and browser fingerprinting. The information and process described here may be subject to data privacy regulations under your legislation. It is your responsibility to comply with all regulations. Please educate yourself if things like GDPR apply to your use case (which is very likely), and act responsibly.

In the last part we have built a configuration for OpenResty to generate user and session IDs and store them in browser cookies. Now we need a way to actually log and collect those IDs together with the requests our web server handles.

OpenResty Configuration

To be able to log our custom variables we need to announce them to Nginx. This is done right in the server-part of the configuration. We need variables for the user and session IDs, if the user or session is new and for the query parameters:

server {
    listen       8888;
    server_name  localhost;

    set $a_visitor_id '';
    set $a_visitor_new '';
    set $a_session_id '';
    set $a_session_entry '';
    set $a_query_args '';

In our header_filter_by_lua_block block we can now assign the values from the previous part to those variables:

ngx.var.a_visitor_id = a_visitor_id
ngx.var.a_visitor_new = a_visitor_new
ngx.var.a_session_id = a_session_id
ngx.var.a_session_entry = a_session_entry
ngx.var.a_query_args = a_query_args

Now Nginx knows about the variables and we are able to use them in our logging configuration. This is done in the http-part of the config. We define a new format called “tracking” and set an access_log for it. By doing this, we have a log line for every request to the web server in the format we specify. To make our life easy, we put our logs in a JSON format. We include a lot of useful variables we would log anyway plus our tracking variables:

log_format  tracking escape=none '{"remote_addr":"$remote_addr","remote_user":"$remote_user","scheme":"$scheme","uri":"$uri","connection_request":"$connection_requests","connection":"$connection","msec":"$msec","pid":"$pid","host":"$host","server_name":"$server_name","time":"$time_iso8601","request":"$request","status":"$status","size":"$body_bytes_sent","query_string":"$query_string","referer":"$http_referer","user_agent":"$http_user_agent","proxy_ip":"$http_x_forwarded_for","visitor_id":"$a_visitor_id","new_visitor":$a_visitor_new,"session_id":"$a_session_id","session_entry":$a_session_entry,"query_array":$a_query_args}';

access_log  logs/access-tracking.log  tracking;

At the end, we have an Nginx configuration with all the parts we need:

http {

    log_format  tracking escape=none '{"remote_addr":"$remote_addr","remote_user":"$remote_user","scheme":"$scheme","uri":"$uri","connection_request":"$connection_requests","connection":"$connection","msec":"$msec","pid":"$pid","host":"$host","server_name":"$server_name","time":"$time_iso8601","request":"$request","status":"$status","size":"$body_bytes_sent","query_string":"$query_string","referer":"$http_referer","user_agent":"$http_user_agent","proxy_ip":"$http_x_forwarded_for","visitor_id":"$a_visitor_id","new_visitor":$a_visitor_new,"session_id":"$a_session_id","session_entry":$a_session_entry,"query_array":$a_query_args}';

    access_log  logs/access-tracking.log  tracking;

    server {
        listen       8888;
        server_name  localhost;

        set $a_visitor_id '';
        set $a_visitor_new '';
        set $a_session_id '';
        set $a_session_entry '';
        set $a_query_args '';

        header_filter_by_lua_block {
            a_visitor_id = ngx.var["cookie_a_vid"]
            a_visitor_new = "false"
            if not a_visitor_id then
                a_visitor_id = ngx.md5(ngx.var.remote_addr..ngx.var.http_user_agent..(ngx.req.get_headers()["Accept"] or '')..(ngx.req.get_headers()["Accept-Encoding"] or '')..(ngx.req.get_headers()["Accept-Language"] or ''))
                a_visitor_new = "true"
            end

            a_session_id = ngx.var["cookie_a_sid"]
            a_session_entry = "false"
            if not a_session_id then
                a_session_id = ngx.md5(ngx.var.remote_addr..ngx.var.http_user_agent..(ngx.req.get_headers()["Accept"] or '')..(ngx.req.get_headers()["Accept-Encoding"] or '')..(ngx.req.get_headers()["Accept-Language"] or '')..ngx.var.msec)
                a_session_entry = "true"
            end

            ngx.header["Set-Cookie"] = {"a_vid="..a_visitor_id.."; Path=/; Expires=" .. ngx.cookie_time(ngx.time() + 63072000), "a_sid="..a_session_id.."; Path=/; Expires=" .. ngx.cookie_time(ngx.time() + 1800)}

            local args, err = ngx.req.get_uri_args()
            arr = {}
            a_query_args = "{"
            for key, val in pairs(args) do
                if type(val) == "table" then
                    table.insert(arr,'"'..key..'":"'.. table.concat(val, "|") ..'"')
                else
                    table.insert(arr,'"'..key..'":"'..val..'"')
                end
            end
            a_query_args = a_query_args .. table.concat(arr, ", ")
            a_query_args = a_query_args .. "}"


            ngx.var.a_visitor_id = a_visitor_id
            ngx.var.a_visitor_new = a_visitor_new
            ngx.var.a_session_id = a_session_id
            ngx.var.a_session_entry = a_session_entry
            ngx.var.a_query_args = a_query_args
        }
    }
}

This gives us log lines like this:

{"remote_addr":"127.0.0.1","remote_user":"","scheme":"http","uri":"/index.html","connection_request":"1","connection":"2","msec":"1578043737.950","pid":"3080","host":"127.0.0.1","server_name":"localhost","time":"2020-01-03T10:28:57+01:00","request":"GET / HTTP/1.1","status":"304","size":"0","query_string":"","referer":"","user_agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36","proxy_ip":"","visitor_id":"7c958ba7becea5f5bea54a80b899330e","new_visitor":false,"session_id":"c5d80b3d6f3f4706cac4eea0a8ad2f12","session_entry":false,"query_array":{}}

For a new user, it could look like this. Consider the session_entry and new_visitor keys:

{"remote_addr":"127.0.0.1","remote_user":"","scheme":"http","uri":"/index.html","connection_request":"1","connection":"1","msec":"1578141208.707","pid":"12660","host":"127.0.0.1","server_name":"localhost","time":"2020-01-04T13:33:28+01:00","request":"GET / HTTP/1.1","status":"200","size":"674","query_string":"","referer":"","user_agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36","proxy_ip":"","visitor_id":"b9323e1cd1379aeb504d40f0ce9c822c","new_visitor":true,"session_id":"607a9bce4d67e4d6c0e9b98848d7f9d1","session_entry":true,"query_array":{}}

Filebeat Configuration

To ship our logfiles to Elasticsearch we need to tell Filebeat where our logfile is, that it is in JSON format and where to ship it. For our project the config is as simple as, given that we want to organize our log data in daily indices:

filebeat.inputs:

- type: log
  enabled: true
  paths:
    - PATH-TO-LOGS\access-tracking.log
  json.add_error_key: true

setup.ilm.enabled: false
setup.template.name: "custom-filebeat-tracking-logs-%{[agent.version]}"
setup.template.pattern: "custom-filebeat-tracking-logs-%{[agent.version]}-*"

output.elasticsearch:
  hosts: ["ES-HOST"]
  index: "custom-filebeat-tracking-logs-%{[agent.version]}-%{+yyyy.MM.dd}"

You need to modify the path to the log file, the Elasticsearch host and the index format. The format here gives us indices like “
custom-filebeat-tracking-logs-7.4.0-2020.01.03″.

Now we are able to track our users, generate some log files with user information and ship it to Elasticsearch.

In the next part we will start working with the data and build a nice dashboard out of it.

Scroll to Top