Legal Disclaimer: Data Privacy is a diverse and ever-changing topic. This makes it nearly impossible to give reliable recommendations to a broad audience. Please consult your company’s legal department on whether those ideas described here are feasible under your jurisdiction.
If there has been one predominant topic in the web analytics space for the last couple of years, it surely is data privacy. GDPR is a thing in Europa, COPPA in the US, ITP on planet Apple, and cookie consent banners on every website. Conducting a safe data collection practice as a global business has become more and more challenging, pushing businesses to be more and more careful.
Because of this landscape, a lot of businesses are looking for a “bullet-proof” way to analyze website users’s behavior. While Google Analytics is a data privacy nightmare, tools like
Piwik Matomo try to justify their existence by claiming to be more privacy friendly, since they can be installed on-premise or in a trusted environment. From my personal experience, this level of confidence can also be reached by using Adobe Analytics with the right contractual setup, but some companies will still want to be extra sure. As a consequence, a lot of businesses use Matomo as webanalytics solution to make their legal team happy.
But there is one problem: While Matomo might be awesome to collect data in a privacy-friendly way on your own infrastructure, its analytics capabilities are vastly inferior to Adobe Analytics. So while you will make your legal team happy, you will hold back your analytics team in a major way, severely hurting your business in the long run. So, what do you do?
Luckily, Adobe gave us Customer Journey Analytics. As a quick reminder: CJA allows us to bring any data, store it in Adobe’s Experience Platform and analyze it in Analysis Workspace. Together with Query Service, we can even do things in CJA I always wanted to have in “normal” Adobe Analytics. That offers a whole new way to look at this situation: Now we are able to collect our data using Matomo and later bring it into Experience Platform and Customer Journey Analytics. You could also do the same with your custom-built analytics tools. Sounds like a fun project, let’s actually try it out!
Getting data out of Matomo
To start things off, let’s plan how best get data out of Matomo. They actually have some content on how this can be done, offering two ways: We could use the HTTP API, but that would inflict quite a lot of stress on our Matomo instance, especially when there is a lot of live-traffic on our website. Luckily, we have another option on a self-hosted instance by directly pulling the data out of the MySQL database that drives Matomo. Let’s write some SQL!
On the same help page mentioned above, Matomo gives us an example for how such a query could look like:
SELECT * FROM matomo_log_visit LEFT JOIN matomo_log_link_visit_action ON matomo_log_visit.idvisit = matomo_log_link_visit_action.idvisit LEFT JOIN matomo_log_action ON matomo_log_action.idaction = matomo_log_link_visit_action.idaction_url LEFT JOIN matomo_log_conversion ON matomo_log_visit.idvisit = matomo_log_conversion.idvisit LEFT JOIN matomo_log_conversion_item ON matomo_log_visit.idvisit = matomo_log_conversion_item.idvisit;
Looks easy enough. The matomo_log_visit table holds information about a session, like, when it started or ended, what the referer or browser was, as well as entry- and exit pages among some other fun stuff. This table is not really necessary for us (since we can do all of that with CJA or Query Service) but we will use it for this example. matomo_log_link_visit_action holds some more interesting data, since it contains all of the clickstream information. Every event will create a row in that table, making it most valuable for us. matomo_log_action is a lookup for on-page actions, which we will use, but matomo_log_conversion and matomo_log_conversion_item will not be part of this first example.
Even without those last two tables, this query gives us a lot of columns (98 to be specific), a lot of them being duplicates. Also, just taking everything without any curation would be bad practice. In addition to that, Matomo has some very helpful documentation on what those tables and columns actually mean. For our use case, we want some basic clickstream information, contained in the matomo_log_link_visit_action table, together with some visit data. While we are at it, we are also joining those lookup tables to our clickstream data, like this:
SELECT * FROM matomo_log_link_visit_action actionlog LEFT JOIN matomo_log_action logactiona ON actionlog.idaction_name = logactiona.idaction LEFT JOIN matomo_log_action logactionb ON actionlog.idaction_url = logactionb.idaction LEFT JOIN matomo_log_visit visitlog ON actionlog.idvisit = visitlog.idvisit LEFT JOIN matomo_log_action logactionc ON visitlog.visit_entry_idaction_name = logactionc.idaction LEFT JOIN matomo_log_action logactiond ON visitlog.visit_entry_idaction_url = logactiond.idaction LEFT JOIN matomo_log_action logactione ON visitlog.visit_exit_idaction_name = logactione.idaction LEFT JOIN matomo_log_action logactionf ON visitlog.visit_exit_idaction_url = logactionf.idaction LEFT JOIN matomo_log_action logactiong ON actionlog.idaction_name_ref = logactiong.idaction LEFT JOIN matomo_log_action logactionh ON actionlog.idaction_url_ref = logactionh.idaction
Now let’s see which columns we actually want. CJA needs a Person ID and a timestamp at minimum, but there are some other cool columns as well. For example, we will pull in the interaction_position, which is equivalent to the Hit Depth dimension in Adobe Analytics. Also, we need to convert the visitor id from binary to string and add milliseconds to our timestamps. The query for my little experiment looks like this now:
SELECT hex(visitlog.idvisitor) visitorid, actionlog.idpageview, actionlog.interaction_position, UNIX_TIMESTAMP(actionlog.server_time)*1000 TIMESTAMP, actionlog.time_spent_ref_action timespent, logactiona.name actionname,logactionb.name actionurl, logactiong.name previousactionname, logactionh.name previousactionurl, UNIX_TIMESTAMP(visitlog.visit_first_action_time)*1000 visitfirstactiontime, UNIX_TIMESTAMP(visitlog.visit_last_action_time)*1000 visitlastactiontime, visitlog.visitor_returning, visitlog.visitor_count_visits, visitlog.visit_total_actions, visitlog.visit_total_interactions, visitlog.referer_url, visitlog.referer_type, visitlog.config_device_model, visitlog.visitor_days_since_last, visitlog.visitor_days_since_first, logactionc.name entryactionname,logactiond.name entryactionurl,logactione.name exitactionname,logactionf.name exitactionurl FROM matomo_log_link_visit_action actionlog LEFT JOIN matomo_log_action logactiona ON actionlog.idaction_name = logactiona.idaction LEFT JOIN matomo_log_action logactionb ON actionlog.idaction_url = logactionb.idaction LEFT JOIN matomo_log_visit visitlog ON actionlog.idvisit = visitlog.idvisit LEFT JOIN matomo_log_action logactionc ON visitlog.visit_entry_idaction_name = logactionc.idaction LEFT JOIN matomo_log_action logactiond ON visitlog.visit_entry_idaction_url = logactiond.idaction LEFT JOIN matomo_log_action logactione ON visitlog.visit_exit_idaction_name = logactione.idaction LEFT JOIN matomo_log_action logactionf ON visitlog.visit_exit_idaction_url = logactionf.idaction LEFT JOIN matomo_log_action logactiong ON actionlog.idaction_name_ref = logactiong.idaction LEFT JOIN matomo_log_action logactionh ON actionlog.idaction_url_ref = logactionh.idaction
That is plenty of data already. I hope you have either good eyes or a large screen, because this is how my demo data looks like:
Next, we should think of how to get all of that data into Adobe Experience Platform. Of course, I have just the right tool for that.
Pushing data into Adobe Experience Platform with Apache NiFi
AEP has two ways for us to get our Matomo data into a dataset. Theoretically we could open our MySQL database to external access and query it from AEP, but that would be a terrible idea from a security perspective. Instead, we will be using the FTP connector in AEP. But how do we get our Matomo data to that FTP server? With ease, and Apache NiFi!
NiFi is an awesome dataflow tool that let’s you crunch through massive lakes of data with a nice visual interface. It’s Open Source, so you can just go ahead and install it. What we need is a simple flow that queries our database, converts the result to a CSV file and uploads it to our FTP server. Such a flow could look like this:
The first processor queries our MySQL database. We just put the SQL code from above into that processor and can schedule it to run once per day. One step below that, we simply convert the resulting AVRO file to CSV and rename the file afterwards. Then, the PutFTP processor does exactly that. Initially, I wanted to explain every step, but they are just too straightforward to justify any further words. Now that we have our CSV file on that FTP server, let’s head to Platform!
Storing Matomo data in Adobe Experience Platform
Of course, things have a tendency to get complex with Experience Platform. Before we can ingest any data, we need to create a schema for our dataset. To do this, I created some fields like this:
As you can see, I don’t bother much about the naming here. The fields are just named after the CSV columns. Note that I marked the visitorid as Identity, so CJA can pick it up later. Next, go ahead and create a new dataset with that schema. Take a note of the dataset name, and head over to the Sources section of AEP to create a new or use an existing FTP connection to where you save your data.
Follow the FTP connection wizard. In my case, AEP was able to map all the fields from my CSV to the XDM automatically:
Review if everything worked here and complete the wizard. Once the data is imported, you can confirm everything worked by looking at the datasource details:
Nice! Now we only need to get it into CJA.
Ingesting Matomo Data into Customer Journey Analytics
Only two small steps left! First, we need to create a new connection in CJA using our Matomo dataset. Make sure you select the visitorid as Person ID:
Complete the connection wizard and import all existing data. Then, go ahead and create a new Data View for this connection. You can just add all components for our first test:
That’s all we have to do for now. Except one thing: Let’s create a new Project to actually look at our juicy Matomo data in Analysis Workspace. Maybe a nice mix of Matomo and CJA dimensions and metrics, like this?
It really worked! Now we have created a fully automated way to ingest our Matomo data into Customer Journey Analytics to combine the strengths of self-hosted data collection with Analysis Workspace. Neat!
I think there is some great potential in a solution like this. Data from Matomo is just one example of what we could do with a setup like this. In addition to Matomo as a source, we could also add logfiles from our webserver or data from our custom-built analytics solution. Beyond this proof-of-concept, there are a lot more fields to add from Matomo if we wanted to extend this concept even further. Using custom IDs in Matomo, we could even combine Adobe Analytics data with Matomo data using a bit of ID stitching.
Again, I personally don’t think Adobe Analytics is more problematic compared to Matomo from a legal perspective, but that is up to your legal team to decide. Thankfully, CJA gives us a way to take the best part of Adobe Analytics and use it with any data. Let me know what you think about this post and have a nice day!