Cool Approximate Count Distinct Use Cases – Adobe Analytics Tips
One of the things that really sets Adobe Analytics apart from other solutions is the ability to create sophisticated Calculated Metrics and Segments on the fly. You don’t need to be a highly trained Analyst or Data Scientist to create your very own set of Measures and Dimensions unique to your business question. The best thing for me personally is that we can create those metrics from the same interface where we do our day-to-day analysis and reporting. It doesn’t matter if we want to quickly create an average or build advanced time series analysis dashboards, it’s all right there at our fingertips.
Today I want to tell you about one of my personal-favorite functions called Approximate Count Distinct. This functionality allows us to count how many different values from a dimension we tracked and use that number in both Calculated Metrics and Segments (making this function the closest we have to calculated metrics in segments). But to my confusion, not many people have heard about it, let alone used it. This is why I wanted to write this post, to make the function wider known and give some nice example use cases. But before we get to those, let’s start with some technical details.
Understanding counting functions in Adobe Analytics
To really grasp how this function works, we need to set it apart from “normal” ways of counting in Analytics. The most simple form of counting is by not caring about what item we count. This is what we do with metrics like Page Views, where we just count the number of times our users loaded a page on our website. In this simple case, we do not care about which page is being counted: 100 Page Views can mean opening 1 page 100 times or 100 pages 1 time each. A step further from that is the Reloads metric, which counts how often a page is tracked again without the Page Name changing from the last Page View. Evars have Instances, counting how often a value is set. In addition to that, we have the option to set an event to only be counted once per Visit or just use the Visits or Unique Visitors metric to get something similar to Google Analytics’s Unique Events.
Within Calculated Metrics, we have two options in terms of counting things: “Normal” and Advanced functions
Before we get to the Advanced function I will quickly go over the first two functions. They simply count how many rows your current Freeform Table returns, with the Count() function giving you the option to only return rows with a metric above 0. This is an important distinction from the Approximate Count Distinct function, which does not count rows but unique values within a row’s item. If you want to learn more about those two simple forms of counting, you may want to check out my previous post about those use cases.
Now on to the main course: How does the Approx. Count Distinct function work exactly? To explain, let’s look at this simple table of activity from a website user:
As we can see, this user has had three Page Views on two different pages. This second information is exactly what the Approx. Count Distinct function gives us: It counts how many different (or: distinct) values from a given dimension our users have used. In the Calculated Metrics Editor, we therefore have to decide on which dimension items we want to count. In this example, I count how many different pages were tracked:
Now we can use this metric to see how many pages are in a given Site Section, like this:
This is a nice example which we will explore further below. But first we should ask ourselves: Why is this an approximate count? The reason for that is quite simple: Analytics wants to be fast. Not only fast, but reliably fast. That requirement makes it nearly impossible to go through (literally) every single row of data in years of data while keeping track of how many dimension items were already counted and incrementing the count if a new item is found. Because of this, Adobe uses a HyperLogLog method of counting, returning results which are at least 95% accurate in at least 95% of requests. One of the consequences is that we can add decimal places to our metric and actually get a result:
There is nothing wrong with this detail, but it is something to keep in mind. We will always get precise results within the boundaries mentioned above, where large values likely mean more deviation in absolute values. Another detail is that this value is evaluated with every request. While Adobe does a great job at keeping values constant over time, you might see some variations for historical data from one day to the next. This behavior is more accustomed to Google Analytics users who are used to sampling, but might also happen with Adobe Analytics. One great detail though: When using this functionality in segments, the count is actually always precise (thanks Jen Lasser for pointing that out!). Now that we have understood those details, let’s go into the actual use cases!
Example 1: Bringing fairness to performance analysis
For my first example, imagine we run the website for a company selling products online while also providing information about those products alongside some client care. Now we might ask ourselves how those different sections of our website are performing relative to each other, defined as the number of Page Views on each Site Section. With the Approx. Count Distinct functions, we are able to bring a completely new view to this analysis, by showing how many actual pages each section has and how many many Page Views each page receives on average. For this example, we simply use the Page Views metric together with the Approximate Count Distinct of the pages dimension as described above. In Workspace, we can then simply divide Page Views by Distinct Pages:
This table clearly shows how the help pages receive a lot less traffic in total, making them seem less relevant, but only because there are way less of those pages compared to product pages! In fact, the relative performance from those help pages excels the care section, which performs better in total. We can do the same with marketing Campaigns, by counting how many individual tracking codes we tracked for each campaign. Consider this analysis:
We can see two things here: The Product Campaign is driving the majority of traffic to our website, but it is also the one with the most individual ads. In terms of relative performance, our Customer Care Campaign is far superior when it comes to Unique Visitors per Tracking Code, showing the quality of each individual ad. Maybe our Sales team can learn something from how our Care team creates their ads. Commence communication!
Example 2: Normalizing values over time
Another challenge in our daily lives is how to compare date ranges with different lengths. For example, how did we do last week compared to last month? To answer this question, we need some form of normalization by creating averages. I already did a post on how to get the old Unique Visitors per Day in Analysis Workspace, so I will keep this example rather short. Since we can trow literally every dimension into the Approximate Count Distinct function, we can just use the builtin time dimensions like Day:
Once we have this Days metric, we can use it in a table to normalize performance over date ranges of different length. Consider this table:
With this nice Page Views per Day metric, we can clearly see that our Page Views last week were actually a bit lower than last month’s average. Knowing this allows us to detect trends more quickly and responding to them immediately. Also, there is no more discussion if this month’s sales are lower because the month has one day less, since we can just account for that now.
Example 3: How long has that page been online?
Running an online business takes a whole lot of people. One of the situations we as the analysts encounter every once in a while is where someone approaches us, asking how long a certain page or teaser or marketing campaign has been online. While it is clear that asking the analyst such a thing should only be the last resort, those situations happen more often than we would like to admit. Luckily, our Days metric from the last example can help us here as well:
Similar to example two, we count the number of days we received data, but this time with the Page dimension instead of date ranges. This shows that the page from the first row has been online for 21 days, while the second one is relatively new. By normalizing the Page Views, we can see that the new page is performing much better than the old page and should definitely be kept online!
Example 4: How much technical debt and complexity slow us down?
Another challenge when developing digital experiences comes with app development. Over time, there will be a lot of legacy version of your app out there, sometimes even with legacy versions of our backend to maintain. When prioritizing sunsetting those old versions, it can be important to know which type of app has the biggest longtail of old versions. We can make this visible by counting different app versions like this:
We can do the same thing with the Operating System dimension to count different OS versions as well. When used with the Operating System Types dimensions, those metrics give us insights like this:
We clearly see an impressive longtail on Android app versions compared to iOS (first column), accounting for 76% of versions out there. On the other hand, we have to support way more operating system versions under iOS (second column) with 71 individual versions. This information can be used when planning feature development and testing efforts for our app.
Example 5: Segmenting our most engaged users
My favorite application of the Approximate Count Distinct functions is when used together with Segments. A great detail in this context: In segments, the count is always precise, with no approximation like in metrics. The functions is again quite well hidden, right at the bottom of the comparison operators for dimensions in the Segment Builder:
In this segment, I’m asking Analytics for Visits where the user has been on two or more individual pages. We can compare this to the Single Page Visits metric when used together with Visits:
As we can see, this segment nicely complements the Single Page Visits metric, showing us how many people who opened a page also moved to another page. We can again do this with any dimension (like product, video asset, teaser, etc.) and even on Visitor level. By creating multiple of those segments, we can put together a nice histogram of how many pages our users visit in a session and how that value changes along the user journey:
This gives us a nice view on how engaged our users really are depending on how often they came back. In this case, we can see the engagement starts to rise from visit 3 onward. We can track this over time and compare it between products or releases quite nicely.
I hope I could give you some inspiration and some useful examples on how the Approximate Count Distinct functions can be used in Calculated Metrics and Segments. To my personal delight, it is also available in Adobe’s Customer Journey Analytics, with some very exiting applications as well. There are tons of other, more specific use cases out there and I would love to hear about yours! Have a great day!
Frequently asked questions
Super easy, just use the Count Distinct function in segments. You can create a few segments for clusters of users and their engagement.
German Analyst and Data Scientist working in and writing about (Web) Analytics and Online Marketing Tech.
2021&2022 Adobe Analytics Champion