Predictive analytics is somewhat mysterious. So, let’s shed some light on it.
(Note that I’m simplifying this quite a bit to be accessible.)
The first step in predictive analytics is to understand what you’re predicting. We’ll call this the Y variable.
In this instance, ‘how many visits from Boston can I expect on a given day’. My Y will be ‘Visits’.
I’m curious about it.
Have some discipline. I see way too many analysts change the Y variable before their investigation is through.
The second step is to identify all the variables that might be associated with a variation in Y. These might include factors like paid media, search, new visits, returning visits – and date. Then there are paid campaigns, posting new content, social campaigns, traditional media spend, promotions, and so on. Day of the week is another key variable, along with statutory holidays, and extending out to other factors like weather and creativity.
The third step is to extract, transform, and load the data you CAN actually access. You can spend months fighting to build an absolute complete model, or, you CAN start putting together a story with the facts that are available. I chose action over inertia. You should too.
That date field is usually pretty bad to extract, transform, and load. There are functions both in excel and SPSS that handle dates with some difficulty. Devils abound in the details around ‘the date where in the world’. If your installation is set to Eastern Time, and most of your traffic comes from Australia, you’ll be one day lagged. You ought to adjust the figures using the appropriate offset.
The figure below is what I could extract from Google Analytics in about an hour. (Collinearity abounds!)
The fourth step is to run the math against your model.
I use SPSS to run a regression. If you don’t have SPSS, you can try using open source programs like Octave or R. The reason for using software is because it’s annoying to do by hand. I didn’t enjoy a copy of SPSS at my first research position, so I had to code out linear regression in Excel. I learned a lot, but it is not expedient!
The figure below is the output from the software.
The way to read the table is Y = Constant + B1(X1) + B2(X2).
So, Visits = 4.888 – 1.872 (istheweekend).
If it’s the weekend, I can predict Visits = 4.888 – 1.872 (1). Which equals 3 visits.
If it’s not the weekend, I can predict Visits = 4.888 – 1.872(0). Which equals 4.888.
Not bad for Boston traffic! And I understand the impact of a single variable on visits.
My dataset is incredibly spikey. So, what’s causing some of that spikyness? I went through all the dates that I posted new content – reran the math, and got the table below.
The model above is the best. It explains 12.7% of the variance in the set.
The equation is: Visits = 4.496 -1.76(istheweekend) + 2.482(newpost).
I can tell – according to this version of reality – that if I want the maximum bump from Boston, posting during the weekday is best. And I can tell the proportional impact of each variable.
Sometimes this answer is good enough. There are more advanced methods – like curvilinear regression, machine learning, and neural networks. There are ways to introduce more variables into the equation. But typically – this method is sufficient to get a first idea about the relationships among variables and their relative importance, rooted in fact, as opposed to gut bias.
The fifth step is to make decisions based on scenarios.
If you take this equation and plot it out, you can engage in a few what-if’s. Would writing more weekend friendly material result in a lower Beta? Would increasing the frequency of new posts drastically improve the performance of the website? If so, by how much? The size of the newpost beta, as compared to the total number of Boston visits per day hints at that relative strength.
That’s the power of predictive analytics.