Two Statements and Four Questions About Big Data

This series appeared on the Eyes on Analytics blog the week of May 14. It’s consolidated here in part because it was popular.

Consider the following two, distilled, points of view:

Statement 1:

“Big Data Analytics is going to change the way we do business. Sure, a lot of it will be routine “I’m okay!” status updates from sensors, but making sense of the key parts of it, like “help me, I’m failing”, will be extremely useful. Companies that were previously exempt from competing on analytics will be disrupted by new entrants who will compete better, either by being more effective or being more efficient. Big Data Analytics is already having a disruptive impact in marketing, where it never used to before, and is gaining huge traction in medicine. There is reason to believe that Big Data Analytics will cause better decision making in the organizations that chose to invest both in the physical infrastructure and in the cultural infrastructure that’s required to truly succeed.”

Statement 2:

“All the big industries that rely on data already have Big Data. Airlines, casinos, Internet arbitrage firms, logistics firms and especially finance already have all the data they need. Indeed, all of that data has made them dumber, not smarter. In fact, even in those sectors, there’s little evidence that executives use all of that data to make substantially better decisions, especially when it comes to big strategic decision making. Did anybody see the economy after 2008? This is all just a second wind of hype coming from the Business Intelligence industry, which has so far failed to make anybody smarter. Don’t buy the hype. The companies that have long competed on analytics, since the 1960′s, have nothing new to learn from this next wave of Big Data Analytics. Just ask the line managers what they need, they’ll tell you.”

Game of Trolls

It’s not fair to label those who hold statement 1 to be true as blind Gartner Hype Cycle finger clicking optimists out to make twelve points on the next deal.

And, it’s not fair to label those who hold statement 2 to be true as See-I-Told-You-So get off my lawn here we go again curmudgeons.

It is fair to say that some among us are trolling.

Let’s not play the troll the game, at least, not for this week.

Let’s assume that both statements contain truth.

Four Questions:

  • Did anything really go wrong with Business Intelligence generally and Web Analytics specifically?
  • Where does the assumption that better data causes better decisions come from?
  • Is that assumption credible?
  • What questions really matter that would cause statement 1 to come true and mitigate the concerns expressed in statement 2?

Two statements, four questions. 

Which is really right, and under which circumstances?

Did anything go wrong in Business Intelligence generally and Web Analytics specifically?

The extreme skeptics towards Big Data Analytics argue that Business Intelligence failed. Others point to cases of success. Who’s right?

Today, we see:

  • Major decisions about pricing, like those recently at Netflix, continue to be made without any analytical support (or, the analytics were completely ignored).
  • Few people, to this day, really understand the numbers they’re looking at. (Few can explain the accurate definition of ‘time spent on site’, for instance).
  • Hundreds of thousands of firms continue to compete just fine without any analytics at all, without any consequences or the cost!

Specifically:

  • Business Intelligence systems are expensive and relatively hard to implement
  • The world moves very quickly, so, frequently, by the end of the third year of a major integration the infrastructure is out of date (Latency)
  • Most people experience data through Crystal Reports or SPSS viewers, which hardly inspire, and generally speaking, it’s a bad user experience

So, there’s good reason to say that BI failed, in the context of the expectations that were initially set.

The high expectations set for such BI systems frequently fail to materialize

And yet, BI is software engineering. And, failure is common in software engineering. Why would we expect 100% success in BI when the success rate in software engineering is so low? Why was this sold as a sure thing?

(Because people buy sure things.)

There have been failures. There have been expectations. There are plenty of scars to go around.

Moreover, everybody, including the people funding these projects, believed that a better dashboard would make them a better driver of a car.

The high expectations set for teams of people frequently fail to materialize

All too often, we expect that we only have to explain a concept once, and that an entire team of people will understand and retain that knowledge.

How many times have you explained the difference between a visit and a daily unique visitor? Or, what time spent on site really means?

It’s not that everybody is stupid or ignorant. Those traits tend to be normally distributed and they tend to cluster in areas of the economy where stupidity and ignorance thrives.

It’s certainly the case that learning takes effort and many people are lazy. It’s also the case that most people don’t spend all day working with this material. Is it any wonder that people forget? It’s not in their job description to remember anything specific.

We expected people to improve their numeracy. To be just as comfortable with a trend line as they are with a word processor. Most analytics professionals expected much better collective decision making.

We expected so much more of people.

What went right?

BI and web analytics made certain individuals a hell of a lot smarter. While the plural of anecdote isn’t evidence, I can say that it made Scott, a line manager I worked with in 1999, much smarter. He used analytics, in real time, to optimize the price of drink specials at the night club he managed. That’s right – contrary to the popular Strata laugh line – certain managers are really capable of making decisions in real time.

Centralized BI practices caused a massive reduction in the bullwhip effect in supply chain logistics. It also led to much more efficient use of warehousing space.

A lot went right. And it made a whole bunch of individuals a whole lot smarter.

What went wrong?

The promise of a new technology didn’t deliver all the benefits as expected. The possibility of failure wasn’t discussed, and the expected results – both in terms of a change in performance and decision making – wasn’t fully realized.

Even those who are hyping Big Data Analytics, and those who are playing down Big Data Analytics, could agree on that.

Where does the assumption that better data causes better decisions come from?

I don’t know for sure.

But I can point to two possible sources:

  • The enlightenment and the scientific revolution
  • Robert McNamara and the Whiz Kid movement

The entire scientific method is predicated on data. A hypothesis is either accepted as truth or rejected as false based on the data. There is no other arbiter. Faith or strength of opinion has nothing to do with it. The data decides.

The assumption that greater knowledge causes greater outcomes flows from that fact. That, if you’re honest about being wrong, that everybody benefits. (Disturbingly, that trend may be reversing in The West, as negative findings have been disappearing from most disciplines.)

You might not be aware of it, but much of what we call Business Intelligence today really started taking off when Robert McNamara and the other whiz kids got back from the. Rule #6 from McNamara is “Get The Data”:

“Even if you don’t have the resources to access everything you need, start with what you have, even that data will show you where to go.”

The great grandfather of the discipline of Operations Research, the great trunk from which marketing science and information management branched off, is based off the assumption that data causes better decisions.

Is that assumption credible?

After all, that does seem to be at the root of those staunchly against Big Data Analytics.

Is the assumption that better data causes better decisions credible?

It depends.

If evidence to the contrary of an individuals aspiration comes to light, and that individual refuses to update their expectations or aspiration, then even the most pristine, accurate, precise and real time data will fail to change their mind.

If evidence to the contrary of an individuals aspiration comes to light, and that individual updates their expectations or aspirations accordingly, then it will be effective at changing their mind.

The key element that decides the effectiveness of data is the human.

Great data can cause great managers to make better decisions.

Great data doesn’t cure ignorance.

Maybe the broader commentary on the value of Big Data has more to do with optimism and pessimism about how our human systems change than it does with the ability of the technology to deliver.

Consider the following two, distilled, points of view:

Statement 1:

“Big Data Analytics is going to change the way we do business. Sure, a lot of it will be routine “I’m okay!” status updates from sensors, but making sense of the key parts of it, like “help me, I’m failing”, will be extremely useful. Companies that were previously exempt from competing on analytics will be disrupted by new entrants who will compete better, either by being more effective or being more efficient. Big Data Analytics is already having a disruptive impact in marketing, where it never used to before, and is gaining huge traction in medicine. There is reason to believe that Big Data Analytics will cause better decision making in the organizations that chose to invest both in the physical infrastructure and in the cultural infrastructure that’s required to truly succeed.”

Statement 2:

“All the big industries that rely on data already have Big Data. Airlines, casinos, Internet arbitrage firms, logistics firms and especially finance already have all the data they need. Indeed, all of that data has made them dumber, not smarter. In fact, even in those sectors, there’s little evidence that executives use all of that data to make substantially better decisions, especially when it comes to big strategic decision making. Did anybody see the economy after 2008? This is all just a second wind of hype coming from the Business Intelligence industry, which has so far failed to make anybody smarter. Don’t buy the hype. The companies that have long competed on analytics, since the 1960′s, have nothing new to learn from this next wave of Big Data Analytics. Just ask the line managers what they need, they’ll tell you.”

Four Questions

  • Did anything really go wrong with Business Intelligence generally and Web Analytics specifically?

It didn’t meet expectations. The technology failed often. The people failed often. Read part 2 for the expanded version.

  • Where does the assumption that better data causes better decisions come from?

It’s hard wired into the scientific method, and, more recently, into Operations Research. Read part 3 for the expanded version.

  • Is that assumption credible?

That depends on the people. Good evidence on good managers makes a difference. Good evidence on willfully ignorant managers is a waste. Read part 4 for the expanded version.

  • What questions really matter that would cause statement 1 to come true and mitigate the concerns expressed in statement 2?

It’s the attitude of the people using the data.

There are those that view Big Data Analytics as a tool for advancing their personal aspirations. For instance, “I really want to go to big fashion shows for free, so, I need to find evidence that a co-sponsorship with big fashion shows are really going to move our bottom line. Go find me that evidence and don’t come back with an answer to the contrary.”

There are those that view Big Data Analytics as a tool for advancing their personal aspirations. For instance, “I really want to increase gross revenue by 10%, so, I need to find pathway and evidence to support that objective. I have a few questions about how the firm really makes money and from who – go find me that evidence.”

It’s greatest barrier isn’t really the technology. It’s the people.

What’s really different this time

This is the third effort in several to express this point of view. Here it is:

In 2000, to build a data warehouse to mine all the IRC and ICQ chat logs, you would be looking at a $30 million investment.

In 2012, to build a cloud to mine all the IRC and ICQ chat logs, you would be looking at a $300,000 investment.

The cloud, plus open source distributed computing technologies like Hadoop, plus the rise of a generation of data scientists who understand the power of decentralization and know how to use it, is what has changed. It has reduced the costs, increased the imagination, and is making possible a Cambrian explosion in startups.

There are big things happening on the technology side.

If North America is old enough to remember BI and really won’t change its attitude towards data and the way it makes decisions, if that’s what people who are in favor of Statement 2 are really saying, then that’s sad.

There’s a whole bunch of people in China, Brazil, Poland and India are too young to remember.

Thanks for reading this five part series on Big Data Analytics. If you want to leave a comment, challenge an assertion, or raise a point, you can do that right below.

Testing Three Themes

Post frequency on the analytics focused blog, Eyes on Analytics has increased to daily. In part, this is to solidify the understanding of the frequency-reach curve in blogging, and in part, it’s an attempt to understand where the broader market is at.

I’m testing three themes:

  • How to fight nature’s pesky way of inhibiting our ability to make clean causal statements.
  • The importance of imagination in identifying independent variables.
  • The role of evidence in decision making.

Simplification of a message is not pandering. However, many pandering statements are deliberate simplifications.

If your optimization objective is to gain followers:

  • Post often.
  • Post simply.
  • Post what people want to hear.

I’m choosing simplification while avoiding pandering.

Let’s see how that unfolds over the next 60 days.

 

Why don’t the campaign components add up?

Sometimes the components of a marketing channel will not add up to equal the total performance of the marketing channel. This is caused by any number of realities and limitations imposed in part by nature, and, in part, by you, the marketer.

Consider the following deliberately simple scenario:

March 2012 Impressions:

  • Total Digital Impressions Delivered: 100,000,000
  • Total Impressions with Chicken Creative: 25,000,000
  • Total Impressions with Beef Creative: 50,000,000
  • Total Impressions with Pork Creative: 75,000,000

Something doesn’t make sense. I’m telling you that 100,000,000 impressions were delivered in total, but each component of that figure: 25 million, 50 million, and 75 million, don’t actually add up.

That’s because creative can have multiple attributes. An ad may feature Chicken alone, Beef alone, or Pork alone. An ad may feature Beef with Pork. An ad may feature Chicken with Beef. An ad may feature Chicken with Pork. In a crazy twist, perhaps some creative features all three! (The madness!). Attributes can cause such complexity when it’s possible for a single thing to have multiple attributes.

The next scenario demonstrates complications that arise because of instrumentation:

March 2012 Impressions:

  • Total Digital Impressions Delivered: 100,000,000
  • Total Impressions served to Males: 60,000,000
  • Total Impressions served to Females: 10,000,000
  • Total Impressions likely served to 35 to 50 year olds: 1,000,000

All people have attributes, but not all people have attributes that can be measured.

It might very well be that for the XBOX Live component, Microsoft can report with greater certainty, owing to profile information, that the content was served to more males. And, because that particular app was geared towards males, there’s greater certainty on that end. It also might be the case that another component was on mommy blogger ad networks, however, the knowledge of the ad targeter was really ethical, and wasn’t uniquely tracking everybody, so, the ‘missing 40 million impressions’ aren’t missing.

The same goes for the age component. We may hypothesize because of Quantcast data that those impressions served on mommy blog networks were heavily 35 to 50 year old females, but, there’s nothing in the instrumentation itself that confirms that hypothesis.

Just because it may be measurable doesn’t guarantee that it will be measured.

Finally, consider the complexity imposed by time:

March 2012 Impressions:

  • Total Digital Impressions Delivered: 100,000,000
  • Total Impressions from Affiliate Program: 10,000,000
  • Total Impressions from the RayRayHayHay campaign: 8,000,000
  • Total Impressions from the A campaign: 1,000,000
  • Total Impressions from the Eh campaign: 1,000,000

Well, CLEARLY the A campaign and the Eh campaign failed – since the affiliates didn’t use those creative treatments much at all. What we don’t know is time.

  • Date the RayRayHayHay campaign creative was posted: January 5, 2012
  • Date the A campaign creative was posted: March 1, 2012
  • Date the Eh campaign creative was posted: March 28, 2012

That’s 1 million impressions served in 3 days for the Eh campaign. That’s 1 million impressions served in 31 days for the A campaign.

Such component analysis is made particularly tricky when we’re trying to do it using a monthly report or some other arbitrary unit of time.

In sum:

Channel performance analysis is not channel component analysis. These are two distinct types of analytics, aimed at answering two different classes questions. For the reasons listed above, attribute overlap, instrumentation limitations, and time, the sum of the components may not add up to the total. This is not a devastating realization if you understand the differences and how to think of them.

There’s a general optimistic sense that drillability, the ability to drill into any metric and see its components, is possible in all contexts. It is possible in some contexts. It is not possible in all contexts. Privacy and technical disruption impose long run constraints in ever being able to achieve that.

It’s not likely to be perfect any time soon, and, in some cases, the components won’t ever add up.

***

(Note to fellow analysts: I chose impressions to keep it really simple. On-site and post-click analysis is required. Statistical analysis exists for a reason, so, even armed with impression and CTR data, you may analyze performance across multiple attributes. Moreover,you ought to be aware of the biases that exist in your data set – is it the case that males really did respond better, or, is it the case that the instrumentation is just better at identifying males?)

 

Who’s Downvoting You On Reddit?

So who keeps on downvoting you on Reddit? We’ll find out.

But first – three notes:

  • You may be familiar with Reddit. If you’re not – you can read this explanation about what Reddit is.
  • To answer that question, I downloaded a dataset that was built in early 2011 or very late 2010. The dataset is a 29MB gzip compressed and contains 7,405,561 votes from 31,927 users over 2,046,401 links. You can read about the methodology here.
  • The file contains three columns – a vote, a userid, and a link. Only people who had their privacy settings set to open had that data read by an API. There is no meta-data about who these people are in real life (IRL) or even what was the nature of the content they were upvoting and downvoting.

So, who’s downvoting you on reddit?

To find out, I took that huge file transformed it into another one – boiling it down into a single user name, how many times that username vote (numberofvotes), and the average of all their votes.

You can see below that _mike voted 26 times, and, if you take the average of all his votes, +1 for an upvote and -1 for a downvote, it turns out to be -.92. Basically, _mike didn’t like a lot of what he saw. In fact, _mike upvoted once (+1) and downvoted 25 times (-25). So (- 25) + (+1) is -24, and -24/26 is -.92.

There are over 30,000 usernames here – and that’s a lot of data. It’s really important to visualize the data before you really get into any analysis. One way to do that is to run a histogram.

To read the histogram below, remember:

  • Frequency means ‘the number of usernames that fall into this category or range’.
  • Numberofvotes means ‘the number of times a username voted.’
  • Mean is another word for average.

There are three takeaways from the histogram above:

  • The average number of votes by a username was 234.
  • A large number of usernames didn’t vote very many times at all.
  • There are bumps at 1000 and 2000 votes. (If you’re interested as to why – see the Methodological notes. Incidentally – this is why you should always visualize your data.)

A histogram is built from a Frequency Table, which we’ll see below.

The way to read a frequency table is:

  • The ‘Valid’ column means ‘how many times a username voted’.
  • Frequency means ‘the number of usernames that falls into this category’.
  • Percent means ‘the percentage of all the usernames that those in this category represents’.

There are three takeaways from the Frequency Table above:

  • 4877 of the usernames only voted one time (It’s likely they submitted a single link and never returned).
  • Note how both the percentages and number of usernames in each category decrease.
  • 50.1% of all the usernames voted 20 times or less. (Look at the cumulative percent column and make sure that makes sense to you. We’re going to use this column later.)

You may have heard the term ‘long tail’ many times before. This is a demonstration of what that means. The bars on the histogram falls away to right.

Recall that the average of all the votes a username made is called ‘averagevote’. If somebody was persistently downvoting links, they’d have a negative number. If they upvoted everything they saw, they’d have an averagevote of +1.

Read the histogram below.  The three takeaways are:

  • Negativity follows a reverse long tail. (It really happens – see how the figures fall away to left)
  • On average, usernames upvoted what they saw (average 0.79).
  • There are bumps at 0 (related to a methodological note) and at -1.

By now, two of my good friends in London are screaming at the screen. Means are a horrible way to explain long tail distributions. You can see that now too. Means are giving us a pretty skewed view of the world.

The table below is a byproduct of our Frequency table. It’s aptly labeled ‘Statistics’, and compares these two variables, numberofvotes, and averagevote, side by side. I’ve thrown a yellow box around ‘percentiles’. Recall the cumulative column from previous frequency table.

  • 22.8% of all usernames voted 2 times or less.
  • 40.8% voted 9 times or less.

The program I’m using is giving me ‘break points’ for those percentiles.

Two takeaways:

  • The median gives a better summary of what’s going on here – half of the usernames voted 20 times or less, and, another set of usernames always upvoted what they saw.
  • If I know that roughly 80% of all usernames posted 325 times or less, then I know that 20% of the usernames in my sample posted 325 times or more.

We’re going to use those percentile cutoff points to inform a segmentation, next.

Segmentation

A segmentation is a grouping of records, usually people, into categories. There is not prescription for how to do this. If you talk to a modeller, they’ll tell you about their clustering algorithms. If you talk to a machine learning scientist, they’ll tell you about bump-hunting or unsupervised machine learning clustering. Those are all very good algorithms. I use them myself.

I’m going for simplicity here. I have these four percentile cut-off points that evenly cut people into five categories. And, for further simplicity, instead of referring to a group of people who posted between 9 and 48 times as ‘those who posted between 9 and 48 times’, I’m going to call them Average-Andy’s. And I’ll just keep on calling them that.

At this point, I don’t know if they’re male or female. (And we won’t in this thread). And it’s controversial to use alliteration. But it’s done.

So, mapping the percentiles against a segmentation, based on how many times a username voted, we have:

  • 1 time: One-Time-Oliver
  • 2 to 9 times: Vanity-Vanessa
  • 9 to 48 times: Average-Andy
  • 48 to 325 times: Frequent-Fred
  • More than 325 times: Power-Pauline

Take a look at the result below – a variable I’m calling ‘equalseg’ – short for ‘equal segmentation’.

Takeaways:

  • There are 4877 One-Time-Olivers, representing 15.5% of the usernames in the sample.
  • Vanity-Vanessa’s represent 23.9% of the usernames.
  • The last three segments are pretty equally divided – the first two are more lopsided.

Even though I aimed to have five groups of people with equal numbers in each, you can see the division between One-Time-Olivers and Vanity-Vanessa’s are off. This happens very often when segmenting a long tail into equal groups. And, while not ideal, it’s okay for our purposes.

Next, we’re going to examine each segment individually.

One-Time-Olivers

There are very efficient ways that statisticians quickly summarize and understand the relationship among variables. The aim here isn’t to be efficient – but to be clear. In that spirit, I give you the histogram below.

Takeaways:

  • All 4877 One-Time-Olivers voted exactly one time.

You should lol. It makes sense though, right? And, the segment name should make a lot more sense.

The histogram below summarizes how, on average, One-Time-Olivers voted – positive or negative. Since they only voted one time, it’s either an upvote, or a downvote. A +1 or -1 average.

 Takeaways:

  • One-Time-Oliver’s tend to upvote once, and are never heard from again.
  • In answering the question – “Who’s downvoting you on Reddit”, it isn’t One-Time-Olivers.

 

 Vanity Vanessa

Vanity accounts frequently enter Reddit, they flicker, and they go out. They get discouraged. They never really commit to the bit. That’s what happens to them. The histogram below takes on that familiar long-tail curve.

Takeaways:

  • There are lot of Vanity-Vanessa’s, some 7,527 of them.
  • Most of them posted only 2, 3, or 4 times.

So, how did they vote?

The histogram below summarizes the story:

 

Takeaways:

  • Vanity-Vanessa’s upvoted nearly everything they saw, with very few exceptions.
  • Very few persistently downvoted everything they saw.
  • They’re not the ones downvoting you on Reddit.

 

Average-Andy’s

Recall that the average username votes 326 times, and yet, I still labeled Average-Andy, ranging between 9 and 48 votes, as average andy. That’s because the mean number of votes that Average-Andy’s cast is 22.25 – which is close to the median of 20 for the entire set.

This mixing and abstraction of median, mean, and segmentation isn’t something that I expect most people to consider or think about, but I can foresee some getting hung up on it. When you think about an equal segmentation though, it makes sense that the mean of your middle category should be close to the median of the entire set.

For everybody else – just know that you’re you’re looking at the “average joe redditor” here.

Takeaways:

  • Average number of votes is 22.25, close to the median of 20 for the whole set.
  • Familiar long tail.

How do they vote?

Takeaways:

  • A majority of Average Andy’s liked everything they saw – they upovoted everything.
  • They downvote more often than Vanity-Vanessa’s or One-Time-Oliver’s, but not massively.
  • They aren’t downvoting in such a huge way to say that these are the ones downvoting you on reddit.

 

Frequent Fred

By now you’re pretty much a pro at reading these histograms. Frequent Fred’s vote frequently. Look at the histogram below.

 

Takeaways:

  • Classic long-tail continues.
  • Averaging 139.3 votes.
  • The unusual bump at the beginning of the series is just magnified by the scale from the previous vote frequency histogram. (It’s fine).

How do they vote?

Takeaways:

  • Far fewer of them are likely to upvote absolutely everything they see.
  • There’s significant flattening of the long tail – the average is .74.
  • More of them, on average, are disposed to downvoting.

Power Paulines

Power Paulines are the most difficult group to analyze, but the easiest to summarize and understand. Take a look at the histogram below.

Takeaways:

  • The long tail is holding – there’s significant clustering at 1000 and 2000.
  • The cause is related to rate limiting within the Reddit API.
  • The longest part of the long tail – those power users with thousands and thousands of votes, are all bundled and clustered together at 2000.
  • There are around 500 of such power users, representing some 1.5% of the total usernames.

So how do they vote?

 Takeaways:

  • The bump at 0 is caused by 1000 upvotes getting averaged out by 1000 upvotes.
  • 0′s aside, which are tugging on the mean, Pauline’s are on average more prone to downvoting.
  • Power Paulines are downvoting you on Reddit.

 

Putting a bow on it

The chart below summarizes the relationship between segment and their average vote. You can see a clear negative direction. The more one uses Reddit, the more one downvotes – even if the mean is exaggerated in the Power Pauline segment.

To really hammer the point home about the origin of downovotes, take a look a the table below. It’s broken out by the segments you understand. It also contains two new variables – upvotes and downvotes. That is the total count of the number of upvotes and downvotes made by each segment.

Takeaways:

  • One-Time Olivers as a group were responsible for 175 of all the downvotes cast.
  • Vanity-Vanessa’s as a group were responsible for 1781 of all the downvotes cast.
  • Average-Andy’s as a group were responsible for 13,258 of all the downvotes cast.
  • Frequent-Fred as a group were responsible for 120,758 of all the downvotes cast.
  • Power-Paulines as a group were responsible for 1,672,368 of all the OBSERVED downvotes cast – but are probably responsible for a lot more in aggregate across all of Reddit. (This sample contains a bias, but bias doesn’t mean I can’t say anything at all about anything.)

Note the differences in order of magnitude between each group. 1781 is roughly 10 times greater than 175. And so, a bit imperfectly on the way up to Frequent-Fred’s. There’s an order of magnitude difference here in terms of the amount of weight each group casts.

The greatest power users users of Reddit are the ones who are downvoting you – and it’s an exponential power.

 

But wait, there’s more.

Recall, however, that there over 7 million votes cast. 1.8 million were downvotes, and 5.5 million were upvotes. Read the statistics table below to verify that.

Takeaways:

  • Upvotes outnumber downvotes.
  • The interface of Reddit itself causes upvotes to accumulate.
  • Reddit itself is a cause of a bias – probably by design.

The histogram below is by links – the content getting upvoted or downvoted. There were just over 2 million links submitted. On average, each link received 3.62 upvotes. Given everything you know about long tails, think about just how deceptive that 3.62 mean figure is. Note how you can’t even see the bumps in the tail. And be in awe of the efficiency of the collective Reddit behavior that causes popular content to disproportionately promoted while even ‘good’ or ‘average’ content gets relentlessly shifted to the left – all by a very small group of people.

Takeaways:

  • The long tail is long and powerful.
  • This small group Power-Paulines are far more likely to downvote because of a much higher frequency of use.

I’m thanking Reddit for making so many API’s publicly exposed and enabling this sort of analysis and exploration. Thank you.

 

Portions of this post appeared on Eyes On Analytics the week of February 5, 2012.

Commentary on the proposed telescreens

You may have read something about the Samsung 7500 and 8000 series televisions, the ones with a camera installed in them, over the past few days.

The tl;dr summary:

“For Samsung’s 7500 and 8000 series TVs, all you have to do is say “Hi, TV,” when you walk into a room for the TV to turn on and know who’s there.”

“Think of it: The tech means an advertiser or TV programmer could, for the first time, know which members of a Nielsen household are watching a show or an ad. Cisco has even developed a system meant to read facial expressions and determine whether you’re entertained or bored.”

“Many people in the living room are multitasking with other devices. “We’re paying for that,” said Rex Harris, innovations supervisor at SMGX, a unit of ad agency holding company Publicis Groupe. “If you’re looking at other screens, then you’re not paying attention. We would like to know if we’re getting accurate impressions.”"

Commentary:

Alright – so – a simple innovation, the webcam, is jumping from the PC/DVR into a TV, and we get a few folks who come out and speculate what it could mean. It all ends up sounding like a 1984 telescreen idea, which, I’m 99% certain, is not what Samsung has/had in mind.

Broadcast isn’t digital.

Repeat: broadcast. isn’t. digital.

This has implications:

  • There is enough inventory for targeted ads and offers in digital because the technology enables the creation of multiple ad treatments at scale. No such technology exists in the broadcast industry.
  • People already effectively segment themselves by TV show preference.
  • On Demand technologies like Netflix, and time shifting technologies like streaming and DVR’s, are already eroding the concentration of key market segments.
  • Plot the S-curve adoption rate of the technologies driving market fragmentation against the adoption of new, Big-Brother enabled telescreens, and see which wins. (Hint: it’s time shifting and on-demand).
  • You’re paying for junk impressions because we’re developing ad blindness, just like we’ve developed banner blindness.

No amount of surveillance is going to change that fact.