Previously in this little series, we took a look at how we can describe our trended data by using the statistical Mean and Standard Deviation. While this works quite well for data that doesn’t change much over time, it is rather limited in regards to take trends into account. With this post, we are doing something about that issue by using Linear Regression techniques.
At the end of this post, you will get an Analysis Workspace project like below, where we can judge trends in data and see changes over time:
Let’s get our hands dirty!
Limitations of Mean and Standard Deviation
Before we start, I want to explain the problem outlined above a bit better. Please consider the following graph I generated with the Workspace from the previous post and some demo data:
What we see is a clear trend in our data, since our daily Unique Visitors are going up over time. This leads to the huge relative traffic fluctuation. And even with this big fluctuation, a lot of our data is outside the expected range. The reason behind this is that our Mean is not changing over time but is calculated for the whole date range. This works fine if our metric doesn’t change over time, but not for this example. We need something else!
Linear Regression and Corridor of Expected Values
So, back to the drawing board. We start with our table with only Unique Visitors. But instead of the Mean, we are going to add an advanced Calculated Metric to get the simple result from a Linear Regression. We are using Unique Visitors as Y and our trusty Incrementor as X, so our Metric looks like this:
Drag that Metric into our table and Graph to see how well it works:
That’s more like it! We can see that our Regression Line fits our data much better. Now we need to recreate our corridor for expected values but there is one small issue: Because the “normal” Standard Deviation is calculated with the Mean it would be way higher than what we want here. This means we have to calculate something similar ourselves, replacing the Mean with our Regression Line. The formula we have to use looks like this:
First, we need the deviation from the Regression and square it. This can be done like this with a Calculated Metric:
The table confirms it works as expected:
Next, we need to sum those values with the Column Sum function and divide that by the number of rows, which is equivalent to the Column Maximum function with our Incrementor Metric. While we are at it, we can also take the Square Root of everything, giving us this handy Metric:
Intuitive, right? Let’s compare it to the normal Standard Deviation, so we drag both Metrics in our table:
Looks promising! We can see that the deviation from our Regression is much smaller than the deviation from the Mean. Just as we suspected! Now we create the same Metric as in the last Post for the upper and lower corridor of expected values, by adding and subtracting the Deviation from the Linear Regression Value. Our Graph and table look like this now:
Now we could count the days above and below expectation, but we already did this in the last post. Instead, we will use the power of Linear Regression to give some more information about our estimation!
Advanced Quality Metrics for Linear Regressions
Because we are using Linear Regression, we can get a lot of “meta-information” about our estimation. There are two very interesting functions we haven’t used yet if we search for “linear regression” in the Metric Builder:
We are going to use the Slope function first. We create a Calculated Metric with Unique Visitors as Y and our Incrementor as X variable. When we drag it into our table, we can see what it does:
That’s very handy! This new Metric shows us, how much our Unique Visitors change each day. In our example with demo data, we on average gain 16 Unique Visitors per day! But there is even more: Let me introduce you to my dear friend, the Coefficient of Determination.
Most people have heard of Correlation, which is a number from -1 to 1 that describes how dependent or independent two variables are from each other. A value of -1 or 1 would mean perfect correlation, where both values change perfectly with each other, whereas 0 means no correlation. This is what Analytics gives us out-of-the-box for a Linear Regression, as you can see in the screenshot of available functions above.
But there is something we can get very quickly that is even more impressing, called the Coefficient of Determination. It is what we get when we square the Correlation Coefficient. It measures how much of the variations in our data can be explained by our Regression, which is some awesome information. It quantitatively expresses the quality of our estimation. Let’s create it! The definition is very simple in the Metric Builder, we just need the Linear Regression Correlation Coefficient and the Power Operator. Use Percent as the format:
Once again we drag it into our Table to see the result for our example data:
Wonderful! Our current Regression explains 62% of the Variations we see in our data. That is a lot!
Putting it all together
We achieved a lot today. We created a Linear Regression Model for our data, a custom deviation Metric, expected upper and lower expectation values and some quality metrics! If you feel accomplished: Awesome, you should!
Let’s once again drag all of our awesome metrics in a table and link it to a Graph and some Summary Numbers. We can even use our Mean from before:
If you are like me, building this has been a lot of fun! As always: Let me know about the awesome stuff you did based on this!