Wow, that’s a long title for a post. As you might have heard, I recently got the chance to dive into Adobe’s Experience Platform since my company got provisioned for Customer Journey Analytics. That means that I now have all my Adobe Analytics data in Platform, ready for Query Service to enrich it.
I think I will do another post on how I amp up my Analytics game by using both Query Service and Customer Journey Analytics together. But this post will give some examples on how to use Query Service in general and how it interacts with Experience Platform. It will be a quick tutorial and provide some use cases for the custom Adobe Defined Functions (ADFs). At the end, you might end up with long SQL queries like I do:
How Query Service works and interacts with Platform
So what actually is Query Service? It provides a SQL interface to interact with datasets stored in Platform. This interface can be used from the Platform UI or with a normal SQL client. On the surface level it utilizes a PSQL syntax for commands, offering the framework regarding syntax and a base level of functionality, like Joins or Unions. Below that surface, Adobe utilizes Spark SQL for some commands beyond normal PSQL to accommodate the big-data-like environment (which also means our queries will take a while to process). To provide some additional functionality there are some Adobe Defined Functions (documented on two pages, here and here) that offer some advanced functions most relevant to Adobe Analytics users.
All our datasets in Platform are represented as tables in Query Service. To start crunching our data, we need to find the table name for those datasets, which is shown on the dataset detail page in platform:
As any SQL table, our tables also have a table schema associated with them. And you guessed right, it strictly follows the dataset’s Schema defined in Platform. So our data will have the structure, column name and data types that are defined there. Once we write data back to a dataset, we also have to follow that Schema. I won’t go into a full SQL tutorial here but rather show you some of my favorite use cases for the Adobe Defined Functions mentioned above.
Recreating Visits using Sessionization
One of the surprises I encountered when starting with Platform is that the Adobe Analytics Connector (ADC) does not include the Visit Number dimension (which is one of my beloved favorites) although the Schema has a key for it. When I pointed this out to Adobe, they referred me to the Sessionization features of Query Service. As I will show, this is not equivalent at all but still very cool, especially for some more creative use cases.
Quoting from the documentation, the syntax for this function looks like this:
SESS_TIMEOUT(timestamp, timeout_in_seconds) OVER ([partition] [order] [frame])
We can see that it is a window function with quite a bunch of parameters to tune. It will return a Struct with four fields, timestamp_diff (containing the number of seconds since the start of the Visit), num (containing the Visit Number), is_new (containing a boolean value for the first hit of a Visit), and depth (containing the number of a hit in a Visit), giving us everything we need to recreate Visits from Adobe Analytics.
If we want a simple session for a given user, we could use a query like this, where our table contains a timestamp and user_id column:
sess_timeout(timestamp, (60 * 30)) OVER (PARTITION BY user_id ORDER BY timestamp ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) session
Let’s go through this. The parameters of the sess_timout function are pretty intuitive. They tell Query Service to define a session based on the timestamp of our rows and allow up to 30 minutes (60 seconds times 30) between two rows as the session limit. The next part is a bit more complicated, since we have to tell the sess_timeout function how we want to define a User (or Visitor). To achieve this, we need to partition our data by the user_id column (returning only the data for the same user_id that the current row has) and order it by timestamp in ascending order. The last part defines that we only want to look at the data before the current row, so we don’t end up looking in the future.
This will give us a result like this:
This shows two things: For three rows I assumed timestamps in Platform were in seconds, while they actually are in milliseconds. And our sessionization works nicely, giving us the four values we expect. Now we successfully have recreated Visits in Platform, neat! But there is a lot more we can do with this function by modifying how we window our data. For example, we could just add the page dimension like this:
sess_timeout(timestamp, (60 * 30)) OVER (PARTITION BY userid, page ORDER BY timestamp ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) pagesession
Instead of just partitioning our data by userid, I now threw the page column in there as well. That way we get a new session whenever the user moves to a new page. But why would we want this? This would not replace our normal session but provide some very interesting insights, like: What is the 3rd action our users take on a certain page? How long does it take them to find a certain popup on a page? Here are some examples on how to partition data in a more creative way:
- Partition by userid and Form name for form tracking: This would show in which order our users fill out our forms and how long each step takes. We would also have the number of the form “fillout-session” within a Visit where multiple forms are used.
- Partition by userid and Ecommerce product: What is the 2nd action our users do with a product in succession? Do they remove it from the cart immediately after adding it?
- Partition by userid and video name: What are our users doing with our video content? Is the second action a pause or closing the video? How long do they keep a video paused before resuming it?
All of this can be done by applying the concept of a session in a more creative way. Let me know which criteria comes to your mind!
Recreating entry- and exit Dimensions
Another thing that is missing from the ADC data is some of my beloved Props functionality. In the data we get, there is no entry- or exit dimension for any of our props, which is of course really sad. As a reminder: Those dimensions hold the first and last value a dimension received in a session, with entry page and exit page being a prime example.
Luckily, we can use another Adobe Defined Function for this. They have built some functions that are intended to do attribution that we can use for that use case. In particular, we are going to use the ATTRIBUTION_FIRST_TOUCH function, which looks like this:
ATTRIBUTION_FIRST_TOUCH(timestamp, [channel_name], column) OVER ([partition] [order] [frame])
We already know most of the parameters from before. What is new are the channel_name and column values, which tell Query Service which column we want to use for attribution. In our case, we want to put the “page” column in there and modify the frame from before like this:
attribution_first_touch(timestamp, '', page) OVER (PARTITION BY userid, session_nr ORDER BY timestamp ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING).value "Session_entry_page"
Let’s look at our partition a bit closer. As you can see, I included both the userid and the session_nr, so we only look for the entry page during the actual session. In addition to that, the frame is now defined as “BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING”, which means we look at the complete data for the whole session. By removing the “CURRENT ROW” limitation, we can get the first page of a session without the need to know it on the first hit of the session.
If we wanted to receive the last page of a session instead, we just need to use the ATTRIBUTION_LAST_TOUCH function instead:
attribution_last_touch(timestamp, '', page) OVER (PARTITION BY userid, session_nr ORDER BY timestamp ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING).value "Session_exit_page"
This will do the same as the previous function, but take the last found value instead of the first. But there again are some cool use cases beyond those! For example, consider this example:
attribution_first_touch(timestamp, '', campaign) OVER (PARTITION BY userid ORDER BY timestamp ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING).value "User_entry_campaign"
With this example, we remove the session limitation (in the PARTITION BY part) to get the first campaign value for a user, even before the user used it! That way we get entry- and exit dimensions, but on a user level, without session restrictions! As a Props fanboy I get really excited by this. What comes to your mind for this functionality?
Next and previous page dimensions
One of the things that is still missing from Adobe Analytics’s Analysis Workspace are next and previous page dimensions to allow for easy pathing analysis. There is a way to create those reports in Workspace, but can’t we have this as a real dimension? Sure, with Query Service! We can use the NEXT and PREVIOUS functions from Adobe for this.
Again, this is a window function that allows us to be quite flexible on what we want. The syntax looks like this:
NEXT(key, [shift, [ignoreNulls]]) OVER ([partition] [order] [frame])
or for the previous function:
PREVIOUS(key, [shift, [ignoreNulls]]) OVER ([partition] [order] [frame])
So what do we have here? With the key parameter, we set the column for which we want the next or previous value. If we want the next value from the page column, we just put that column there. We can even set how far we want to go into the past or future by modifying the shift parameter from the default value of 1. The partitioning can be modified as described above to look at a session, user, or anything else. For example, the next and 2nd next page can be created like this:
NEXT(page,1,true) OVER (PARTITION BY userid, session_nr ORDER BY timestamp ASC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING).value "next_page_session" NEXT(page,2,true) OVER (PARTITION BY userid, session_nr ORDER BY timestamp ASC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING).value "second_next_page_session"
That’s a very convenient way to get this very useful information. Again, we don’t need to limit ourselves to sessions, users, or the page dimension. For example, why not do this on a user level for campaigns or marketing channels? The possibilities are endless!
Writing data back to Platform
Once we have run all our fancy queries we will want to store the result in Platform. This is possible but a bit complicated, depending on your usecase. If you only want to store the result for later use, you can just click on a saved or logged query and select “Output dataset”:
This opens a dialog to name our dataset and runs the Query again, but stores the output instead of just displaying it. Since all datasets in Platform need a Schema, Query Service will create an Adhoc XDM Schema:
I find this to be a double edged sword. While it is very convenient to be able to store data on the fly, we can neither see nor edit the Schema. Tools like Customer Journey Analytics need a defined Schema to know which columns hold the timestamp and identity for the Person ID. There are two ways we can tackle this issue.
One way is to create a new dataset and insert our resulting data into that dataset. The official documentation describes the syntax like this:
INSERT INTO table_name select_query
Easy enough, right? Yes, but there is something we need to watch out for. We need to closely match the Schema associated with that dataset. I actually needed some support from Adobe to find out how Query Service expects the structure to be, since all our custom defined fields in a schema end up in a namespace unique to our company. So our Schema might look like this:
Because of this we need to put all our fields in the _mycompanyid namespace to match the Schema. We need to do this using a STRUCT-structure to our query:
INSERT INTO testdataset SELECT STRUCT(...) as _mycompanyid, monotonically_increasing_id() "_id", 'custom' "eventType", sessiondata.timestamp "timestamp" FROM
Within that STRUCT, we need to put our columns in exactly the right order with exactly the right type. This is also true for the second way to store data, which is by creating a new dataset with an existing schema like this:
CREATE TABLE enrichedstructtest WITH (schema='Enriched TS Schema Demo') AS (SELECT STRUCT(...) as _mycompanyid ...)
Now we have a new dataset called enrichedstructtest with the Schema “Enriched TS Schema Demo”. Figuring this out can take quite some time since Platform is not always super helpful with its error messages. While syntax errors return immediately, type errors are much harder to find, since they only show up like this:
I was quite puzzled when I received that “Failed to get results of query Reason: [Writing job aborted.]” error for the first time. But there is a (quite hidden) way to find the reason. You need to head to Monitoring -> Set the filter to “Only ingest failures” -> Click you last executed query. It will not have a dataset name, since the data could not be written:
Once you click the Failures link, you have to quickly take a screenshot of the following page, since it will autoclose itself within seconds (due to the missing dataset. Thanks, Adobe…) automatically. If you are fast enough, you end up with two contradicting statements:
On the left side, Adobe asks us to contact support. But on the right side, they actually tell us what went wrong. In this case I tried to write a boolean value into a string field (for demo purposes of course, I would never actually make this mistake…), which is not allowed. Once we fix this, our query goes through without problems, or you will get a new error message.
I hope you found this article helpful. You should be able to create some advanced fields with Query Service and write it back to an existing or new dataset. Also you now know how to find out what went wrong if datasets can’t be written back successfully.
In the next post, I will go a bit further and show how to build everything I’ve ever wanted from Adobe Analytics in Customer Journey Analytics based on queries from Query Service. Stay tuned!