Building your own Web Analytics from Log Files – Part 3: Setting up Nginx with OpenResty

This is the third part of the six-part-series “Building your own Web Analytics from Log Files”.

Legal Disclaimer: This post describes how to identify and track the users on your website using cookies and browser fingerprinting. The information and process described here may be subject to data privacy regulations under your legislation. It is your responsibility to comply with all regulations. Please educate yourself if things like GDPR apply to your use case (which is very likely), and act responsibly.

Identifying Users and Sessions

One of our goals for this project is to be able to tell how many people are using our site. This means we need a way to differentiate between the users on our site.

One approach would be to look at the IP addresses of our users. This is not very precise since all devices with the same internet connection share an IP address. Especially for offices or other shared connections we have a lot of fuzziness. Also, with mobile devices becoming more and more popular, those addresses can change very quickly.

Our next option would be browser fingerprinting. With a pure backend solution like the one we are building, we are limited to the request headers the user’s browser sends when a page is requested. In there we have a user agent and which languages and encodings the browser accepts. That is not very unique, so we would treat a lot of users as the same person. We could combine it with the IP address to tackle that, but that address might change and inflate our user numbers.

To circumvent those issues we will be using cookies to identify our users. When a new user comes to our site they will get a cookie with a unique id. On every request after that, the browser will include that cookie allowing us to attribute those requests to that specific user. Depending on how we generate that ID we might be able to fingerprint users, giving them the same ID even if cookies are deleted.

The same will be done on a session basis. This will basically be a second cookie with a different expiration.

So, let’s start building that!

OpenResty setup

Setting up our web server is the easiest step. If you have experience with setting up Nginx, you most likely know what to do already.

Information on how to install OpenResty can be found on their website. They offer packages for Linux and Brew instructions for Mac. Installation on Windows is as easy as downloading the Zip archive and running the binary.

For development I run OpenResty locally on port 8888. Once everything runs smoothly we can take a look at the configuration to start identifying our users.

OpenResty Directives

Like I described before, OpenResty plugs into the different request phases Nginx uses to service a request. The different phases are described in detail in their documentation. This allows us to tap into exactly the point in the request chain that we need.

For our project we are going to use the Content Phase. This is the second to last phase in which everything regarding initialization and rewriting has already happend. Since we want to give some Cookies to our users we are going to use the header_filter_by_lua directive to modify the response headers the web server generates.

What we need to do here is to check if our user already has the cookies we need. If that is the case, we will use the ID stored in those cookies. If not, we will generate an ID and store it in the cookies and use it for logging.

Configuration

The first part of the configuration is to declare the server-section of our nginx config and add the header_filter_by_lua_block-block:

server {
   listen       8888;
   server_name  localhost;

   header_filter_by_lua_block {
   }
}

Inside the header_filter_by_lua_block we can put our Lua code that handles the cookie and ID part. As the first thing we will check if the user already has an ID cookie. I named our cookies “a_vid” for the user ID and “a_sid” for the session ID. Since Nginx exposes cookie values as variables we access it via the ngx.var object. Also, we are checking if the user is new to our site (if they don’t have an ID).
The interesting part happens after that: If the user does not have an ID we hash the request headers and the IP address to generate that ID. Note we are using an empty string if those headers are not present. At the end we have two variables, a_visitor_id and a_visitor_new, which tell us the ID of the user and if they are new to our site:

a_visitor_id = ngx.var["cookie_a_vid"]
a_visitor_new = "false"
if not a_visitor_id then
    a_visitor_id = ngx.md5(ngx.var.remote_addr..ngx.var.http_user_agent..(ngx.req.get_headers()["Accept"] or '')..(ngx.req.get_headers()["Accept-Encoding"] or '')..(ngx.req.get_headers()["Accept-Language"] or ''))
    a_visitor_new = "true"
end

Next, we are doing the same thing to get our session ID. We also want to know if the current request is the first of a session. At the end, we have two more variables, a_session_id and a_session_entry, with the latter telling us if the request is considered an entry (meaning the first request for a session). Note that we also include the current timestamp in the ID generation here to have some entropy in the IDs:

a_session_id = ngx.var["cookie_a_sid"]
a_session_entry = "false"
if not a_session_id then
    a_session_id = ngx.md5(ngx.var.remote_addr..ngx.var.http_user_agent..(ngx.req.get_headers()["Accept"] or '')..(ngx.req.get_headers()["Accept-Encoding"] or '')..(ngx.req.get_headers()["Accept-Language"] or '')..ngx.var.msec)
    a_session_entry = "true"
end

Now that we have those IDs we want to set or refresh our cookies. This is done by the Set-Cookie header via the ngx.header variable. For this project we want to be able to identify our users for two years. A sessions should be held open for half an hour, which is an industry standard for sessions:

ngx.header["Set-Cookie"] = {"a_vid="..a_visitor_id.."; Path=/; Expires=" .. ngx.cookie_time(ngx.time() + 63072000), "a_sid="..a_session_id.."; Path=/; Expires=" .. ngx.cookie_time(ngx.time() + 1800)}

To make our life a bit easier, we will put our query parameters in an easy-to-digest form. This can be helpful if we want to track marketing campaigns later on. All we need to do is iterate over the query params, which we can acquire by ngx.req.get_uri_args(). This block gives us the a_query_args object, which contains the parameters in a structured way:

local args, err = ngx.req.get_uri_args()
arr = {}
a_query_args = "{"
for key, val in pairs(args) do
    if type(val) == "table" then
        table.insert(arr,'"'..key..'":"'.. table.concat(val, "|") ..'"')
    else
        table.insert(arr,'"'..key..'":"'..val..'"')
    end
end
a_query_args = a_query_args .. table.concat(arr, ", ")
a_query_args = a_query_args .. "}"

At the end, our block of code is able to identify users and sessions and gives us a nice array of query parameters:

server {
    listen       8888;
    server_name  localhost;

    header_filter_by_lua_block {
        a_visitor_id = ngx.var["cookie_a_vid"]
        a_visitor_new = "false"
        if not a_visitor_id then
            a_visitor_id = ngx.md5(ngx.var.remote_addr..ngx.var.http_user_agent..(ngx.req.get_headers()["Accept"] or '')..(ngx.req.get_headers()["Accept-Encoding"] or '')..(ngx.req.get_headers()["Accept-Language"] or ''))
            a_visitor_new = "true"
        end

        a_session_id = ngx.var["cookie_a_sid"]
        a_session_entry = "false"
        if not a_session_id then
            a_session_id = ngx.md5(ngx.var.remote_addr..ngx.var.http_user_agent..(ngx.req.get_headers()["Accept"] or '')..(ngx.req.get_headers()["Accept-Encoding"] or '')..(ngx.req.get_headers()["Accept-Language"] or '')..ngx.var.msec)
            a_session_entry = "true"
        end

        ngx.header["Set-Cookie"] = {"a_vid="..a_visitor_id.."; Path=/; Expires=" .. ngx.cookie_time(ngx.time() + 63072000), "a_sid="..a_session_id.."; Path=/; Expires=" .. ngx.cookie_time(ngx.time() + 1800)}

        local args, err = ngx.req.get_uri_args()
        arr = {}
        a_query_args = "{"
        for key, val in pairs(args) do
            if type(val) == "table" then
                table.insert(arr,'"'..key..'":"'.. table.concat(val, "|") ..'"')
            else
                table.insert(arr,'"'..key..'":"'..val..'"')
            end
        end
        a_query_args = a_query_args .. table.concat(arr, ", ")
        a_query_args = a_query_args .. "}"
    }
}

This block checks every request for our cookies and sets them as needed. After this block, we would include our normal Nginx configuration as we would normally do.

In the next part, we will look at how to log and collect that data.