Correlation, Causation, Analytics and Pumpkins

Correlation, Causation, Analytics and Pumpkins

Informing Predictive Analytics with Automated Analytics and Human Insight

I saw a tweet that made me laugh at Halloween…

It’s all too easy to infer causation from correlation –butterflies flapping their wings in Japan causing lightning storms in Nova Scotia… migration patterns of geese affecting the price of sausages… or an application volume driver causing the VMs in a web tier CPU to rise with an ‘m value’ of 6.3.

(I had to bring it round to something relevant!)

Many tools claim to find the correlations in your data automatically but how do you know if they are the right correlations?  How do you know you can trust the model that’s been encoded?  More often than not, this is where less a different type of data source is required to inform the model – people.

The people inside the organisation know the application better than anyone else.  They can often explain those outliers, changes that took place, upgrades, patch releases… you can detect that a change occurred in the data, you can often see the impact of that change – but more often than not, you can’t determine what caused it and therefore what the appropriate response should then be.

Relationships between different components of an end user service are complex and the behaviour of those components changes under different load conditions, we know not everything behaves in the same way, this is when correlation and causation starts to become confusing

Here’s an example…

We’re looking at correlations between CPU utilisation and a business metric ‘calls per minute’ in this chart below.  There’s a number of outliers there towards the top of the Y-axis but there’s a general trend that can be seen.


Let’s have a look at the time series in question.  The top time series is the volume metric and the bottom series is the CPU utilisation.  There’s a few interesting things you can see from this.

You can see a clear daily and weekly pattern, with call volumes dropping over the weekend.  On the CPU chart, you can also see a daily and weekly pattern, but with a big spike in CPU at the weekend.  Most interesting though is that after that spike, the daily peak on CPU is much lower than it was previously?  What changed?  Was this an upgrade?  Is it part of a regular cycle (we don’t have enough historic data to know that).  Was a configuration change made?  What impact does this have on our projections?

The chart below shows different correlations based on different samples of the data.  The green line shows us the correlation between CPU and volumes prior to the weekend.  The orange line shows the same but after the change, where CPU utilisation was lower.


Using this information, we can now have good and meaningful conversations with application architects and service managers that help us really build the starting model we need.

Do we use our initial findings and the weekend spike was part of a regular cycle for which we don’t have data?  What caused the CPU spike over the weekend – does it have anything at all to do with the CPU change the following week or is that just a coincidence?  Do we need more data to train our model?  Do we use the ‘post change’ correlation as indicated by the orange line the chart above?

Turns out there was an upgrade over the weekend, it wasn’t part of a regular cycle and it wasn’t something that we would want in our modelled correlations.  It also had the benefit of bringing the CPU impact of call volumes down.   This means that we can pick, as starting point for modelling from this point onwards, the correlation defined as the orange line in the above chart.  From then on, retraining our model on a regular basis to keep the growth function up to date and to determine any further anomalies.

Service led capacity management services such as Sumerian CPaaS have cloud compute capabilities, the long term data persistence, predictive modelling and data science capabilities that inform service behaviour as patterns of service use change over time and business cycles.

Combining this capability with data science and the business knowledge of the service and infrastructure teams directly informs service risk and resilience outcomes, vital information for the business.

So, back to our pumpkin… blindly inferring correlations from causation is a scary thing and needs to be avoided – but with a bit of data science and interaction with the right people, those correlations can really make a difference to your bottom line.

By | 2017-11-01T16:36:38+00:00 November 1st, 2017|