Web Analytics with Adobe’s Customer Journey Analytics, Part 5: Basic Data Processing in Query Service

Frederik Werner | Posted on July 3, 2021 |

This post is the fifth post of the eight-part-series Web Analytics with Adobe’s Customer Journey Analytics, showing how web sites can be analyzed better using Adobe’s next evolution of Adobe Analytics. In the previous post, we took a look at doing the implementation using Adobe Launch, the Adobe Web SDK, and Client Data Layer. In this post, we are going to processing some basic data we need for our web analytics use case utilizing Query Service in Experience Platform.

This series of posts is coming along quite nicely. If you followed all the previous posts until now, you will now have a functioning Web SDK implementation that tracks your data into Experience Platform following the Experience Data Schema we have tailor-made for our use case. Nice! Now we are ready to feed our data into Customer Journey Analytics, right?

Well, we could. If we are just interested in the plain event data we are collecting, we could just pour it into Customer Journey Analytics without any further work. But our goal is a bit different, since we want to build a full-fledged web analytics solution, which needs to have more than just the raw stream of events. And while we will do some fun stuff using Attribution IQ in CJA, there are just some things we can’t do with Customer Journey Analytics alone.

Specifically, we can’t use anything that Adobe Analytics is storing in the Visitor Profile. I assume you also don’t want to sacrifice things like the Visit Number, Entry- and Exit Page reports, etc. Those things are pretty much a no-brainer in any modern web analytics tool, but remember that Customer Journey Analytics is not just a web analytics tool but much more versatile and generalized in what it can do with data. Luckily, Experience Platform has Query Service, which allows us to do everything we want if we put in the work to crunch the data ourselves. And that is just what we are going to do today!

Extending our Experience Data Model Schema

Before we can start adding any data to our raw dataset, we need to extend our XDM schema to actually hold all those new fields. There are a ton of things we could add, but to keep this first example concise I’ve added two parts. First, let’s look at the webPageDetails object, which has been extended by a few new fields:

As you can see, we now have the both the next and previous page on every hit. We also have the time spent, which will hold the time until the next page view (not only next hit). To round things off, we have a page view counter and inverse page view counter, which will hold the number of a certain page view in a session as well as how many page views will happen after that given page view. Quite cool, right?

Next, we need to also include information about the session. I’ve extended my schema like this on the top level:

There is quite a bit to unpack here. First, you can see that I have a session.meta object that holds the Hit Depth (as we know it from Adobe Analytics), which just counts the number of the current hit in a session, as well as the inverse of that (counting how many hits are left in a session). In addition to that, we have the session number (same as Visit Number in Adobe Analytics) and the time spent per session.

Second, there are even more fields holding session-level information about the pages in a session. Starting with the simplest, we will have an entry- and exit page dimension. Something we wanted from Adobe Analytics for quite some time now will also be there, like different pathing options. To round things off, we will have the number of page views in the whole session and the number of distinct viewed pages. Super cool, right?

Recreating session information

Luckily we can take quite a bit of information from my basic and advanced post about Query Service to get the required SQL we need. First, to get things like entry- and exit page right, we need to recreate the session information first. To do that, we can use the sess_timeout Adobe Defined Function like this:

select 
 *, 
 SESS_TIMEOUT(timestamp, (60 * 30)) OVER (PARTITION BY identityMap.ECID ORDER BY timestamp ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) session
 from frederiks_website_dataset

In reality, you would replace the name of the table with your own (mine is named frederiks_website_dataset), but the rest should just work if you followed my XDM structure. That gives us a session object in the resulting table, containing the time since the last hit, session number, first hit of a session and number of that hit in the session. For the next queries, we are going to wrap this first one so we have the session information available as well.

Next, let’s recreate the entry- and exit page reports from Adobe Analytics. Those dimensions are defined at a session level (as in: they don’t change from hit to hit in a session) and show the first and last page of a Visit on every hit. To do that, we can use the attribution_first_touch and attribution_last_touch functions that Adobe provides in Query Service:

select *,
 attribution_first_touch(timestamp, '', web.webPageDetails.name) OVER (PARTITION BY identityMap.ECID, session.num ORDER BY timestamp ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING).value "session_entry_page", 
 attribution_last_touch(timestamp, '', web.webPageDetails.name) OVER (PARTITION BY identityMap.ECID, session.num ORDER BY timestamp ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING).value "session_exit_page"
 from (
 select 
 *, 
 SESS_TIMEOUT(timestamp, (60 * 30)) OVER (PARTITION BY identityMap.ECID ORDER BY timestamp ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) session
 from frederiks_website_dataset order by timestamp asc
 )

There are multiple things happening here. Notice how first I wrap the previous query as the source for the current one, since I need to already have the session information available. Then, I use the Adobe Defined Functions to look for the web.webPageDetails.name field (which holds the page name) as the first and last thing in a session. As result we get two new columns, session_entry_page and session_exit_page which, if there was a page view in a session, now hold the first and last page. Super awesome! If you want to know more about those functions, you can go back to my previous post about Query Service.

To get the other session information that can be expressed as minimum or maximum aggregations, like the time spent, we can use some fancy window functions to get the minimum and maximum timestamp for a session to then derive the session duration. That could look like this:

min(timestamp) OVER (PARTITION BY identityMap.ecid, session.num ORDER BY timestamp ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) session_timestamp_min,
 max(timestamp) OVER (PARTITION BY identityMap.ecid, session.num ORDER BY timestamp ASC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) session_timestamp_max

Later on, we are going to subtract the min from the max to get the session duration. But before we do that, let’s fill the last two things for this post: The number of page views and distinct pages in a session:

max(session.depth) OVER (PARTITION BY identityMap.ecid, session.num ORDER BY timestamp ASC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) session_depth_max,
 sum(web.webpagedetails.pageviews.value) OVER (PARTITION BY identityMap.ecid, session.num ORDER BY timestamp ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) session_page_views,
 approx_count_distinct(web.webpagedetails.name) OVER (PARTITION BY identityMap.ecid, session.num ORDER BY timestamp ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) session_distinct_pages

The first command gives us the highest hit number of the session, while the second and third give us the sum of page views and the count of different pages in a session.

Now with all that done, it’s time to create the last metric for this post, which is the session duration. To do that, we have to subtract the smallest timestamp from a session from the largest one (signifying the first and last hit) and write that into another column. To do that, we have to once again wrap our previous query to be able to reference it. My final query for this post now looks like this:

SELECT
   *
 , (to_unix_timestamp(session_timestamp_max) - to_unix_timestamp(session_timestamp_min)) session_duration
 FROM
   (
    SELECT
      *
    , attribution_first_touch(timestamp, '', web.webpagedetails.name) OVER (PARTITION BY identityMap.ecid, session.num ORDER BY timestamp ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING).value session_entry_page
    , attribution_last_touch(timestamp, '', web.webpagedetails.name) OVER (PARTITION BY identityMap.ecid, session.num ORDER BY timestamp ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING).value session_exit_page
    , min(timestamp) OVER (PARTITION BY identityMap.ecid, session.num ORDER BY timestamp ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) session_timestamp_min
    , max(timestamp) OVER (PARTITION BY identityMap.ecid, session.num ORDER BY timestamp ASC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) session_timestamp_max
    , max(session.depth) OVER (PARTITION BY identityMap.ecid, session.num ORDER BY timestamp ASC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) session_depth_max
    , sum(web.webpagedetails.pageviews.value) OVER (PARTITION BY identityMap.ecid, session.num ORDER BY timestamp ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) session_page_views
    , approx_count_distinct(web.webpagedetails.name) OVER (PARTITION BY identityMap.ecid, session.num ORDER BY timestamp ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) session_distinct_pages
    FROM
      (
       SELECT
         *
       , sess_timeout(timestamp, (60 * 30)) OVER (PARTITION BY identityMap.ecid ORDER BY timestamp ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) session
       FROM
         frederiks_website_dataset
       ORDER BY timestamp ASC
    ) 
 )

Conclusion

I hope I’m not the only person who really enjoys figuring out how to create those data points in query service. It’s really a lot of fun for me! Since the data is collected and available once we can use it in Query Service, we can do some pretty amazing things.

In the next post, we are going even one step further by creating even more sophisticated path reports and figure out, how we can write the data back to a dataset. Stay tuned!

Frederik Werner

German Analyst and Data Scientist working in and writing about (Web) Analytics and Online Marketing Tech.

2021 – current Adobe Analytics Champion

EMEA Adobe Analytics User Group Lead

Adobe Analytics Community Advisor

Extending our Experience Data Model Schema

Recreating session information

Conclusion

Other cool posts: