Commentary on the proposed telescreens
You may have read something about the Samsung 7500 and 8000 series televisions, the ones with a camera installed in them, over the past few days.
The tl;dr summary:
“For Samsung’s 7500 and 8000 series TVs, all you have to do is say “Hi, TV,” when you walk into a room for the TV to turn on and know who’s there.”
“Think of it: The tech means an advertiser or TV programmer could, for the first time, know which members of a Nielsen household are watching a show or an ad. Cisco has even developed a system meant to read facial expressions and determine whether you’re entertained or bored.”
“Many people in the living room are multitasking with other devices. “We’re paying for that,” said Rex Harris, innovations supervisor at SMGX, a unit of ad agency holding company Publicis Groupe. “If you’re looking at other screens, then you’re not paying attention. We would like to know if we’re getting accurate impressions.”"
Commentary:
Alright – so – a simple innovation, the webcam, is jumping from the PC/DVR into a TV, and we get a few folks who come out and speculate what it could mean. It all ends up sounding like a 1984 telescreen idea, which, I’m 99% certain, is not what Samsung has/had in mind.
Broadcast isn’t digital.
Repeat: broadcast. isn’t. digital.
This has implications:
- There is enough inventory for targeted ads and offers in digital because the technology enables the creation of multiple ad treatments at scale. No such technology exists in the broadcast industry.
- People already effectively segment themselves by TV show preference.
- On Demand technologies like Netflix, and time shifting technologies like streaming and DVR’s, are already eroding the concentration of key market segments.
- Plot the S-curve adoption rate of the technologies driving market fragmentation against the adoption of new, Big-Brother enabled telescreens, and see which wins. (Hint: it’s time shifting and on-demand).
- You’re paying for junk impressions because we’re developing ad blindness, just like we’ve developed banner blindness.
No amount of surveillance is going to change that fact.
Jan
15
- Continue Reading →
- Christopher Berry
- No Comments
- Analytics, Strategic Analytics
Find Hidden Patterns in Big Data – A Commentary on MINE, Reshef et al (2011)
You may have read something about ‘Detecting Novel Associations in Large Data Sets’, a paper appearing in Science, 334, 1518 (2011) by David N. Reshef et al.. You can check out the software here.
This is an initial commentary and an explanation about what it’s all about.
The Longer You Look, The More Likely Error will Find You
Take a very large dataset, say, all the customers of AT&T and their calling records 2001-2011, and divide it into to two random but equal sets. Say you didn’t have any hypothesis at all. You just wanted to see what was related to each other in that set. Say, each customer record has 5000 features, including gender, date of birth, credit score, average call durations, most frequently dialed number, and so on. (Note to statisticians: Assume a Pearson R correlation matrix, skip next paragraph).
Assume, further, that you’re going to compare each feature against one another. So, you compared all the ages against all the date of births. And then all the ages against credit scores, and so on. And, the strength of the relationship between those two features was expressed by a single number. The higher that number is, the stronger the relationship between the two. For instance, we might find that credit score and age are tightly correlated – the older one is, the more likely their credit score is to be positive.
You’re likely to find clearly incorrect relationships in such a large table, just by accident. You might find that in Dataset A, for instance, that’s there’s a statistically significant relationship between being a Virgo and having a negative credit score. There might be a relationship between average call duration and being a Capricorn. You know that such a result doesn’t make sense. Why would zodiac sign (derived from date of birth) affect those things? The way that chance works in such large tables is that the longer you look for significant features, the more likely it is that you’ll find a relationship that doesn’t in fact hold in the real world.
In fact, most of those relationships would disappear in Dataset B. However, new, clearly untrue relationships would appear in Dataset B that don’t exist in Dataset A. When you’re dealing with thousands of features, the likelyhood of such phenomenon increases. And that’s even holding everything we know about probability to be true.
In sum, a big reason why you go into a dataset with a hypothesis is to reduce the risk of coming up with something that is wrong, and very unlikely to be repeatable in other datasets.
Linear, Cubic, Exponential, Parabolic, Elipse
Not all relationships are straight lines. Indeed, especially in certain types of logistic regression, we can get very amazing, very beautiful and complex shapes separating one case from another. Diaper usage plotted against age is a parabolic relationship. Think about it. You use a lot of them when you’re young, you go through a lot of them when you’re very old. You don’t need too many of them in early to late age. Linear regression wouldn’t perform very well in detecting that pattern.
Enter Reshef et al and MIC
MIC stands for Maximal Information Coefficient. Reshef et al invented a neat way of looking at relationships between variables that doesn’t rely solely on a key statistical test (Pearson R) to indicate that it’s there. The authors demonstrated how MIC manages to detect correlations between all these complex relationship types – Cubic, Exponential, Sinusoidal – and does it really well. The went further. The created a program that can mine very large datasets and suggest relationships to examine.
What’s the Problem?
Remember that the longer you look, the more likely you’ll find something false, idea? The entire idea of hypothesis testing as the basis of quantitative analysis is an entrenched one. It’s an idea that causes resistance to advanced machine learning algorithms and pattern discovery. Reshef really did a great job in explaining the purpose of MIC. Reshef has merely stated that this is a hypothesis informing machine. You can use the program and MIC to discover relationships that were once really quite hidden. Or very, very difficult to discover without insanely expensive software. I think this is great.
The Opportunity
We’re generating huge amounts of data. The big feature big data problem is increasingly common. This is a great tool to rapidly inform hypotheses – to become smarter before getting smarter. It’s a welcome advancement, and worthy of attention.
If you hear of MIC, just know that a MIC of 0.00 means that there is no correlation between two variables, and that a MIC of 1.00 indicates a perfect correlation between two variables. Be aware that MIC does not imply linearity between the variables, but may be of a much higher order function. The second question you should ask upon hearing a MIC score is ‘at what confidence interval is it significant?’, and, ‘what kind of relationship is it?’. Then deep dive.
I’m excited.
Dec
18
- Continue Reading →
- Christopher Berry
- No Comments
- Complexity Analytics, Data Science
How to predict how many visits a website will receive on a given day
Predictive analytics is somewhat mysterious. So, let’s shed some light on it.
(Note that I’m simplifying this quite a bit to be accessible.)
The first step in predictive analytics is to understand what you’re predicting. We’ll call this the Y variable.
In this instance, ‘how many visits from Boston can I expect on a given day’. My Y will be ‘Visits’.
I’m curious about it.
Have some discipline. I see way too many analysts change the Y variable before their investigation is through.
The second step is to identify all the variables that might be associated with a variation in Y. These might include factors like paid media, search, new visits, returning visits – and date. Then there are paid campaigns, posting new content, social campaigns, traditional media spend, promotions, and so on. Day of the week is another key variable, along with statutory holidays, and extending out to other factors like weather and creativity.
The third step is to extract, transform, and load the data you CAN actually access. You can spend months fighting to build an absolute complete model, or, you CAN start putting together a story with the facts that are available. I chose action over inertia. You should too.
That date field is usually pretty bad to extract, transform, and load. There are functions both in excel and SPSS that handle dates with some difficulty. Devils abound in the details around ‘the date where in the world’. If your installation is set to Eastern Time, and most of your traffic comes from Australia, you’ll be one day lagged. You ought to adjust the figures using the appropriate offset.
The figure below is what I could extract from Google Analytics in about an hour. (Collinearity abounds!)
The fourth step is to run the math against your model.
I use SPSS to run a regression. If you don’t have SPSS, you can try using open source programs like Octave or R. The reason for using software is because it’s annoying to do by hand. I didn’t enjoy a copy of SPSS at my first research position, so I had to code out linear regression in Excel. I learned a lot, but it is not expedient!
The figure below is the output from the software.
The way to read the table is Y = Constant + B1(X1) + B2(X2).
So, Visits = 4.888 – 1.872 (istheweekend).
If it’s the weekend, I can predict Visits = 4.888 – 1.872 (1). Which equals 3 visits.
If it’s not the weekend, I can predict Visits = 4.888 – 1.872(0). Which equals 4.888.
Not bad for Boston traffic! And I understand the impact of a single variable on visits.
My dataset is incredibly spikey. So, what’s causing some of that spikyness? I went through all the dates that I posted new content – reran the math, and got the table below.
The model above is the best. It explains 12.7% of the variance in the set.
The equation is: Visits = 4.496 -1.76(istheweekend) + 2.482(newpost).
I can tell – according to this version of reality – that if I want the maximum bump from Boston, posting during the weekday is best. And I can tell the proportional impact of each variable.
Sometimes this answer is good enough. There are more advanced methods – like curvilinear regression, machine learning, and neural networks. There are ways to introduce more variables into the equation. But typically – this method is sufficient to get a first idea about the relationships among variables and their relative importance, rooted in fact, as opposed to gut bias.
The fifth step is to make decisions based on scenarios.
If you take this equation and plot it out, you can engage in a few what-if’s. Would writing more weekend friendly material result in a lower Beta? Would increasing the frequency of new posts drastically improve the performance of the website? If so, by how much? The size of the newpost beta, as compared to the total number of Boston visits per day hints at that relative strength.
That’s the power of predictive analytics.
Nov
16
- Continue Reading →
- Christopher Berry
- No Comments
- Predictive Analytics
Siri and Search
Gary Morgenthaler had a few interesting statements to make:
“Therefore, when Siri was an independent company, its plan was to map these domains deeply and seamlessly to automate transactions for its users within them. For example, “Buy that Steve Jobs biography book and send it to my dad”; “Send a dozen yellow roses to my wife”; “Book me the usual table for 2 tonight at 8 p.m. at Giovanni’s”; and “Get me 2 box seats for the Giants game on Saturday.”
Then comes the question of what solves our biggest problems. Ultimately, Siri’s value is that of automation and removing “friction” on the Internet. Siri achieves this by: (1) understanding speech input in natural language form, (2) mapping user requests against its knowledge base (i.e., ontological domains) and (3) activating software “agents” to interact with Internet service providers to fulfill user requests.”
Source: TechCrunch
Let’s just forget Google for a minutes and focus in on this combination of technologies.
- Understand.
- Map.
- Act.
That’s the general design pattern for a whole range of applications.
Certainly nothing new here.
They’ve solved a good problem. There are certain use cases for which Siri is a great solution.
He ignores the rest of the problem space. And that’s just fine. I don’t expect him to point out the subset of infinite use cases that Siri is woefully inadequate for.
Barriers, like a small keyboard, are soon to be resolved by virtual keypads and a range of next generation hand gestures that are sensed, not tactically received. I don’t see them as insurmountable.
Even Star Trek TNG made use of both voice and physical commands.
Siri is not a Google-search killer.
It is a nice complement.
Nov
09
- Continue Reading →
- Christopher Berry
- No Comments
- Data Science, Design Thinking
























Feb
12