Building an Enterprise Grade OpenSource Web Analytics System – Part 3: Data Collection

This is the third part of a seven-part-series explaining how to build an Enterprise Grade OpenSource Web Analytics System. In this post we are setting up the tracking backend with Nginx and Filebeat. In the last post we took care of the client side implementation of Snowplow Analytics. If you are new to this series it might help to start with the first post.

Now that we have a lot of data that is being sent from our clients, we need to build a backend to take care of all the events we want. Since we are sending our requests unencoded via GET, we can just configure our web server to write all requests to a logfile and send them off to the processing layer.

Configuring Nginx with Filebeat

In our last project we used a configuration just like the one we need. As web server, we used and will use Nginx for it’s awesome performance and flexible log configurability. To enable scalability, we can use multiple server and balance them via DNS loadbalancing. Nginx itself can be used as a further loadbalancer if we ever need more capacity. But we will keep things easy for Nginx, so one server should be plenty for the start.

Let’s start with our Nginx config. The first part is the basic setup to tell the server how many worker processes we need. This config is perfectly fine for our experiment but should be customized in production:

worker_processes  1;
events {
    worker_connections  1024;
}

Next, let’s start setting up the http-section of the config. The most important part is the log format we will use, so this is what we will use for our project:

log_format  tracking escape=json '{"remote_addr":"$remote_addr",
    "scheme":"$scheme",
    "msec":"$msec",
    "host":"$host",
    "server_name":"$server_name",
    "time":"$time_iso8601",
    "status":"$status",
    "size":"$body_bytes_sent",
    "query_string":"$query_string",
    "request_body":"$request_body",
    "uri":"$uri",
    "referer":"$http_referer",
    "user_agent":"$http_user_agent",
    "proxy_ip":"$http_x_forwarded_for",
    "request_id":"$request_id"}';
access_log  logs/access-tracking.log  tracking;

Those lines define a custom log format called “tracking” and define where it is saved on line 16. Down from line 1, we include everything we want to be tracked in the logfile. The most important part is line 9: This gives us the variables that we send to the backend in Snowplow. If we would use POST, line 10 would include that as well. The other lines are more standard parameters typically found in a logfile.

Now all we need is to define the actual endpoint for our tracking events. For now, we will be using our own computer on port 8888, so that’s what we define:

server {
    listen       8888 default_server;
    server_name  localhost;
    ssl off;
    location / {
        empty_gif;
    }
}

Line 5 is especially interesting, since it tells Nginx to return an empty 1×1 pixel GIF image on every request. It wouldn’t be webtracking without a pixel, right? ?

All in all, this can be all we need for our initial Nginx config:

worker_processes  1;
events {
    worker_connections  1024;
}
http {
    log_format  tracking escape=json '{"remote_addr":"$remote_addr","scheme":"$scheme","msec":"$msec","host":"$host","server_name":"$server_name","time":"$time_iso8601","status":"$status","size":"$body_bytes_sent","query_string":"$query_string","request_body":"$request_body","uri":"$uri","referer":"$http_referer","user_agent":"$http_user_agent","proxy_ip":"$http_x_forwarded_for","request_id":"$request_id"}';
    access_log  logs/access-tracking.log  tracking;
    server {
        listen       8888 default_server;
        server_name  localhost;
        ssl off;
        location / {
            empty_gif;
        }
    }
}

Now that we have that config in place, Nginx will give us some nice logfiles like this for our Snowplow requests:

{
  "remote_addr": "127.0.0.1",
  "scheme": "http",
  "msec": "1586767599.001",
  "host": "127.0.0.1",
  "server_name": "localhost",
  "time": "2020-04-13T10:46:39+02:00",
  "status": "200",
  "size": "43",
  "query_string": "stm=1586767599000&e=pp&url=http%3A%2F%2F127.0.0.1%2Fsp%2Findex.html%23&page=my%20custom%20page%20title&refr=http%3A%2F%2F127.0.0.1%2Fsp%2Findex.html&pp_mix=0&pp_max=0&pp_miy=0&pp_may=0&tv=js-2.12.0&tna=webtracking&aid=testpage&p=web&tz=Europe%2FBerlin&lang=de-DE&cs=windows-1252&f_pdf=1&f_qt=0&f_realp=0&f_wma=0&f_dir=0&f_fla=0&f_java=0&f_gears=0&f_ag=0&res=1680x1050&cd=24&cookie=1&eid=6bb54901-c7d4-4bbd-8d49-a95248ce8843&dtm=1586767598999&vp=1125x957&ds=1125x957&vid=9&sid=ad5f99b5-9aeb-4a24-af79-1095a4237d9f&duid=6ad7e729-3392-4b9e-9663-f3ef9e49831d&fp=557407330&co=%7B%22schema%22%3A%22iglu%3Acom.snowplowanalytics.snowplow%2Fcontexts%2Fjsonschema%2F1-0-0%22%2C%22data%22%3A%5B%7B%22data%22%3A%7B%22a%20Date%22%3A%22Mon%20Apr%2013%202020%2010%3A46%3A08%20GMT%2B0200%20(Mitteleurop%C3%A4ische%20Sommerzeit)%22%2C%22a%20String%22%3A%22Hello%20there!%22%7D%7D%2C%7B%22schema%22%3A%22iglu%3Acom.snowplowanalytics.snowplow%2Fweb_page%2Fjsonschema%2F1-0-0%22%2C%22data%22%3A%7B%22id%22%3A%22a664552f-9cac-4228-b25f-d23ecb87ebd3%22%7D%7D%2C%7B%22schema%22%3A%22iglu%3Aorg.w3%2FPerformanceTiming%2Fjsonschema%2F1-0-0%22%2C%22data%22%3A%7B%22navigationStart%22%3A1586767568195%2C%22unloadEventStart%22%3A1586767568202%2C%22unloadEventEnd%22%3A1586767568202%2C%22redirectStart%22%3A0%2C%22redirectEnd%22%3A0%2C%22fetchStart%22%3A1586767568195%2C%22domainLookupStart%22%3A1586767568195%2C%22domainLookupEnd%22%3A1586767568195%2C%22connectStart%22%3A1586767568195%2C%22connectEnd%22%3A1586767568195%2C%22secureConnectionStart%22%3A0%2C%22requestStart%22%3A1586767568197%2C%22responseStart%22%3A1586767568199%2C%22responseEnd%22%3A1586767568199%2C%22domLoading%22%3A1586767568204%2C%22domInteractive%22%3A1586767568211%2C%22domContentLoadedEventStart%22%3A1586767568211%2C%22domContentLoadedEventEnd%22%3A1586767568211%2C%22domComplete%22%3A1586767568226%2C%22loadEventStart%22%3A1586767568226%2C%22loadEventEnd%22%3A1586767568227%7D%7D%5D%7D",
  "request_body": "",
  "uri": "/analytics/i",
  "referer": "http://127.0.0.1/sp/index.html",
  "user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36",
  "proxy_ip": "",
  "request_id": "e28544ed81f62d06162137eaf11a0a44"
}

With that config in place and validated, let’s tell Filebeat to send that information to Kafka. This could be done with a config like this:

filebeat.inputs:
- type: log
  enabled: true
  paths:
    - [Path to Nginx]\logs\access-tracking.log
  json.add_error_key: true
output.kafka:
  hosts: ["[Kafka Host and Port]"]
  key: '%{[json.host]}-%{[json.request_id]}'
  topic: 'tracking_raw'
  partition.round_robin:
    reachable_only: false
  required_acks: 1
  compression: gzip
  max_message_bytes: 1000000

There is nothing special going on here. Line 6 tells us where our logfile is stored. From line 9 on we configure our connection to Kafka. Fill in your connection settings as needed. If you want to, we can use line 12 to generate an ID for Kafka based on the server host and the Nginx request ID, which should be a nice and unique identifier. The most important part is line 12, where we define which Kafka topic should be used for our data. We will use “tracking_raw” for this project.

Now we are done! Once you open our webpage with everything in place and running, you should see the log lines being written to our Kafka topic. In the next post we are going to do some more with our data.