Who’s Downvoting You On Reddit?

So who keeps on downvoting you on Reddit? We’ll find out.

But first – three notes:

  • You may be familiar with Reddit. If you’re not – you can read this explanation about what Reddit is.
  • To answer that question, I downloaded a dataset that was built in early 2011 or very late 2010. The dataset is a 29MB gzip compressed and contains 7,405,561 votes from 31,927 users over 2,046,401 links. You can read about the methodology here.
  • The file contains three columns – a vote, a userid, and a link. Only people who had their privacy settings set to open had that data read by an API. There is no meta-data about who these people are in real life (IRL) or even what was the nature of the content they were upvoting and downvoting.

So, who’s downvoting you on reddit?

To find out, I took that huge file transformed it into another one – boiling it down into a single user name, how many times that username vote (numberofvotes), and the average of all their votes.

You can see below that _mike voted 26 times, and, if you take the average of all his votes, +1 for an upvote and -1 for a downvote, it turns out to be -.92. Basically, _mike didn’t like a lot of what he saw. In fact, _mike upvoted once (+1) and downvoted 25 times (-25). So (- 25) + (+1) is -24, and -24/26 is -.92.

There are over 30,000 usernames here – and that’s a lot of data. It’s really important to visualize the data before you really get into any analysis. One way to do that is to run a histogram.

To read the histogram below, remember:

  • Frequency means ‘the number of usernames that fall into this category or range’.
  • Numberofvotes means ‘the number of times a username voted.’
  • Mean is another word for average.

There are three takeaways from the histogram above:

  • The average number of votes by a username was 234.
  • A large number of usernames didn’t vote very many times at all.
  • There are bumps at 1000 and 2000 votes. (If you’re interested as to why – see the Methodological notes. Incidentally – this is why you should always visualize your data.)

A histogram is built from a Frequency Table, which we’ll see below.

The way to read a frequency table is:

  • The ‘Valid’ column means ‘how many times a username voted’.
  • Frequency means ‘the number of usernames that falls into this category’.
  • Percent means ‘the percentage of all the usernames that those in this category represents’.

There are three takeaways from the Frequency Table above:

  • 4877 of the usernames only voted one time (It’s likely they submitted a single link and never returned).
  • Note how both the percentages and number of usernames in each category decrease.
  • 50.1% of all the usernames voted 20 times or less. (Look at the cumulative percent column and make sure that makes sense to you. We’re going to use this column later.)

You may have heard the term ‘long tail’ many times before. This is a demonstration of what that means. The bars on the histogram falls away to right.

Recall that the average of all the votes a username made is called ‘averagevote’. If somebody was persistently downvoting links, they’d have a negative number. If they upvoted everything they saw, they’d have an averagevote of +1.

Read the histogram below.  The three takeaways are:

  • Negativity follows a reverse long tail. (It really happens – see how the figures fall away to left)
  • On average, usernames upvoted what they saw (average 0.79).
  • There are bumps at 0 (related to a methodological note) and at -1.

By now, two of my good friends in London are screaming at the screen. Means are a horrible way to explain long tail distributions. You can see that now too. Means are giving us a pretty skewed view of the world.

The table below is a byproduct of our Frequency table. It’s aptly labeled ‘Statistics’, and compares these two variables, numberofvotes, and averagevote, side by side. I’ve thrown a yellow box around ‘percentiles’. Recall the cumulative column from previous frequency table.

  • 22.8% of all usernames voted 2 times or less.
  • 40.8% voted 9 times or less.

The program I’m using is giving me ‘break points’ for those percentiles.

Two takeaways:

  • The median gives a better summary of what’s going on here – half of the usernames voted 20 times or less, and, another set of usernames always upvoted what they saw.
  • If I know that roughly 80% of all usernames posted 325 times or less, then I know that 20% of the usernames in my sample posted 325 times or more.

We’re going to use those percentile cutoff points to inform a segmentation, next.

Segmentation

A segmentation is a grouping of records, usually people, into categories. There is not prescription for how to do this. If you talk to a modeller, they’ll tell you about their clustering algorithms. If you talk to a machine learning scientist, they’ll tell you about bump-hunting or unsupervised machine learning clustering. Those are all very good algorithms. I use them myself.

I’m going for simplicity here. I have these four percentile cut-off points that evenly cut people into five categories. And, for further simplicity, instead of referring to a group of people who posted between 9 and 48 times as ‘those who posted between 9 and 48 times’, I’m going to call them Average-Andy’s. And I’ll just keep on calling them that.

At this point, I don’t know if they’re male or female. (And we won’t in this thread). And it’s controversial to use alliteration. But it’s done.

So, mapping the percentiles against a segmentation, based on how many times a username voted, we have:

  • 1 time: One-Time-Oliver
  • 2 to 9 times: Vanity-Vanessa
  • 9 to 48 times: Average-Andy
  • 48 to 325 times: Frequent-Fred
  • More than 325 times: Power-Pauline

Take a look at the result below – a variable I’m calling ‘equalseg’ – short for ‘equal segmentation’.

Takeaways:

  • There are 4877 One-Time-Olivers, representing 15.5% of the usernames in the sample.
  • Vanity-Vanessa’s represent 23.9% of the usernames.
  • The last three segments are pretty equally divided – the first two are more lopsided.

Even though I aimed to have five groups of people with equal numbers in each, you can see the division between One-Time-Olivers and Vanity-Vanessa’s are off. This happens very often when segmenting a long tail into equal groups. And, while not ideal, it’s okay for our purposes.

Next, we’re going to examine each segment individually.

One-Time-Olivers

There are very efficient ways that statisticians quickly summarize and understand the relationship among variables. The aim here isn’t to be efficient – but to be clear. In that spirit, I give you the histogram below.

Takeaways:

  • All 4877 One-Time-Olivers voted exactly one time.

You should lol. It makes sense though, right? And, the segment name should make a lot more sense.

The histogram below summarizes how, on average, One-Time-Olivers voted – positive or negative. Since they only voted one time, it’s either an upvote, or a downvote. A +1 or -1 average.

 Takeaways:

  • One-Time-Oliver’s tend to upvote once, and are never heard from again.
  • In answering the question – “Who’s downvoting you on Reddit”, it isn’t One-Time-Olivers.

 

 Vanity Vanessa

Vanity accounts frequently enter Reddit, they flicker, and they go out. They get discouraged. They never really commit to the bit. That’s what happens to them. The histogram below takes on that familiar long-tail curve.

Takeaways:

  • There are lot of Vanity-Vanessa’s, some 7,527 of them.
  • Most of them posted only 2, 3, or 4 times.

So, how did they vote?

The histogram below summarizes the story:

 

Takeaways:

  • Vanity-Vanessa’s upvoted nearly everything they saw, with very few exceptions.
  • Very few persistently downvoted everything they saw.
  • They’re not the ones downvoting you on Reddit.

 

Average-Andy’s

Recall that the average username votes 326 times, and yet, I still labeled Average-Andy, ranging between 9 and 48 votes, as average andy. That’s because the mean number of votes that Average-Andy’s cast is 22.25 – which is close to the median of 20 for the entire set.

This mixing and abstraction of median, mean, and segmentation isn’t something that I expect most people to consider or think about, but I can foresee some getting hung up on it. When you think about an equal segmentation though, it makes sense that the mean of your middle category should be close to the median of the entire set.

For everybody else – just know that you’re you’re looking at the “average joe redditor” here.

Takeaways:

  • Average number of votes is 22.25, close to the median of 20 for the whole set.
  • Familiar long tail.

How do they vote?

Takeaways:

  • A majority of Average Andy’s liked everything they saw – they upovoted everything.
  • They downvote more often than Vanity-Vanessa’s or One-Time-Oliver’s, but not massively.
  • They aren’t downvoting in such a huge way to say that these are the ones downvoting you on reddit.

 

Frequent Fred

By now you’re pretty much a pro at reading these histograms. Frequent Fred’s vote frequently. Look at the histogram below.

 

Takeaways:

  • Classic long-tail continues.
  • Averaging 139.3 votes.
  • The unusual bump at the beginning of the series is just magnified by the scale from the previous vote frequency histogram. (It’s fine).

How do they vote?

Takeaways:

  • Far fewer of them are likely to upvote absolutely everything they see.
  • There’s significant flattening of the long tail – the average is .74.
  • More of them, on average, are disposed to downvoting.

Power Paulines

Power Paulines are the most difficult group to analyze, but the easiest to summarize and understand. Take a look at the histogram below.

Takeaways:

  • The long tail is holding – there’s significant clustering at 1000 and 2000.
  • The cause is related to rate limiting within the Reddit API.
  • The longest part of the long tail – those power users with thousands and thousands of votes, are all bundled and clustered together at 2000.
  • There are around 500 of such power users, representing some 1.5% of the total usernames.

So how do they vote?

 Takeaways:

  • The bump at 0 is caused by 1000 upvotes getting averaged out by 1000 upvotes.
  • 0′s aside, which are tugging on the mean, Pauline’s are on average more prone to downvoting.
  • Power Paulines are downvoting you on Reddit.

 

Putting a bow on it

The chart below summarizes the relationship between segment and their average vote. You can see a clear negative direction. The more one uses Reddit, the more one downvotes – even if the mean is exaggerated in the Power Pauline segment.

To really hammer the point home about the origin of downovotes, take a look a the table below. It’s broken out by the segments you understand. It also contains two new variables – upvotes and downvotes. That is the total count of the number of upvotes and downvotes made by each segment.

Takeaways:

  • One-Time Olivers as a group were responsible for 175 of all the downvotes cast.
  • Vanity-Vanessa’s as a group were responsible for 1781 of all the downvotes cast.
  • Average-Andy’s as a group were responsible for 13,258 of all the downvotes cast.
  • Frequent-Fred as a group were responsible for 120,758 of all the downvotes cast.
  • Power-Paulines as a group were responsible for 1,672,368 of all the OBSERVED downvotes cast – but are probably responsible for a lot more in aggregate across all of Reddit. (This sample contains a bias, but bias doesn’t mean I can’t say anything at all about anything.)

Note the differences in order of magnitude between each group. 1781 is roughly 10 times greater than 175. And so, a bit imperfectly on the way up to Frequent-Fred’s. There’s an order of magnitude difference here in terms of the amount of weight each group casts.

The greatest power users users of Reddit are the ones who are downvoting you – and it’s an exponential power.

 

But wait, there’s more.

Recall, however, that there over 7 million votes cast. 1.8 million were downvotes, and 5.5 million were upvotes. Read the statistics table below to verify that.

Takeaways:

  • Upvotes outnumber downvotes.
  • The interface of Reddit itself causes upvotes to accumulate.
  • Reddit itself is a cause of a bias – probably by design.

The histogram below is by links – the content getting upvoted or downvoted. There were just over 2 million links submitted. On average, each link received 3.62 upvotes. Given everything you know about long tails, think about just how deceptive that 3.62 mean figure is. Note how you can’t even see the bumps in the tail. And be in awe of the efficiency of the collective Reddit behavior that causes popular content to disproportionately promoted while even ‘good’ or ‘average’ content gets relentlessly shifted to the left – all by a very small group of people.

Takeaways:

  • The long tail is long and powerful.
  • This small group Power-Paulines are far more likely to downvote because of a much higher frequency of use.

I’m thanking Reddit for making so many API’s publicly exposed and enabling this sort of analysis and exploration. Thank you.

 

Portions of this post appeared on Eyes On Analytics the week of February 5, 2012.

Commentary on the proposed telescreens

You may have read something about the Samsung 7500 and 8000 series televisions, the ones with a camera installed in them, over the past few days.

The tl;dr summary:

“For Samsung’s 7500 and 8000 series TVs, all you have to do is say “Hi, TV,” when you walk into a room for the TV to turn on and know who’s there.”

“Think of it: The tech means an advertiser or TV programmer could, for the first time, know which members of a Nielsen household are watching a show or an ad. Cisco has even developed a system meant to read facial expressions and determine whether you’re entertained or bored.”

“Many people in the living room are multitasking with other devices. “We’re paying for that,” said Rex Harris, innovations supervisor at SMGX, a unit of ad agency holding company Publicis Groupe. “If you’re looking at other screens, then you’re not paying attention. We would like to know if we’re getting accurate impressions.”"

Commentary:

Alright – so – a simple innovation, the webcam, is jumping from the PC/DVR into a TV, and we get a few folks who come out and speculate what it could mean. It all ends up sounding like a 1984 telescreen idea, which, I’m 99% certain, is not what Samsung has/had in mind.

Broadcast isn’t digital.

Repeat: broadcast. isn’t. digital.

This has implications:

  • There is enough inventory for targeted ads and offers in digital because the technology enables the creation of multiple ad treatments at scale. No such technology exists in the broadcast industry.
  • People already effectively segment themselves by TV show preference.
  • On Demand technologies like Netflix, and time shifting technologies like streaming and DVR’s, are already eroding the concentration of key market segments.
  • Plot the S-curve adoption rate of the technologies driving market fragmentation against the adoption of new, Big-Brother enabled telescreens, and see which wins. (Hint: it’s time shifting and on-demand).
  • You’re paying for junk impressions because we’re developing ad blindness, just like we’ve developed banner blindness.

No amount of surveillance is going to change that fact.

Find Hidden Patterns in Big Data – A Commentary on MINE, Reshef et al (2011)

You may have read something about ‘Detecting Novel Associations in Large Data Sets’, a paper appearing in Science, 334, 1518 (2011) by David N. Reshef et al.. You can check out the software here.

This is an initial commentary and an explanation about what it’s all about.

The Longer You Look, The More Likely Error will Find You

Take a very large dataset, say, all the customers of AT&T and their calling records 2001-2011, and divide it into to two random but equal sets. Say you didn’t have any hypothesis at all. You just wanted to see what was related to each other in that set. Say, each customer record has 5000 features, including gender, date of birth, credit score, average call durations, most frequently dialed number, and so on. (Note to statisticians: Assume a Pearson R correlation matrix, skip next paragraph).

Assume, further, that you’re going to compare each feature against one another. So, you compared all the ages against all the date of births. And then all the ages against credit scores, and so on. And, the strength of the relationship between those two features was expressed by a single number. The higher that number is, the stronger the relationship between the two. For instance, we might find that credit score and age are tightly correlated – the older one is, the more likely their credit score is to be positive.

You’re likely to find clearly incorrect relationships in such a large table, just by accident. You might find that in Dataset A, for instance, that’s there’s a statistically significant relationship between being a Virgo and having a negative credit score. There might be a relationship between average call duration and being a Capricorn. You know that such a result doesn’t make sense. Why would zodiac sign (derived from date of birth) affect those things? The way that chance works in such large tables is that the longer you look for significant features, the more likely it is that you’ll find a relationship that doesn’t in fact hold in the real world.

In fact, most of those relationships would disappear in Dataset B. However, new, clearly untrue relationships would appear in Dataset B that don’t exist in Dataset A. When you’re dealing with thousands of features, the likelyhood of such phenomenon increases. And that’s even holding everything we know about probability to be true.

In sum, a big reason why you go into a dataset with a hypothesis is to reduce the risk of coming up with something that is wrong, and very unlikely to be repeatable in other datasets.

Linear, Cubic, Exponential, Parabolic, Elipse

Not all relationships are straight lines. Indeed, especially in certain types of logistic regression, we can get very amazing, very beautiful and complex shapes separating one case from another. Diaper usage plotted against age is a parabolic relationship. Think about it. You use a lot of them when you’re young, you go through a lot of them when you’re very old. You don’t need too many of them in early to late age. Linear regression wouldn’t perform very well in detecting that pattern.

Enter Reshef et al and MIC

MIC stands for Maximal Information Coefficient. Reshef et al invented a neat way of looking at relationships between variables that doesn’t rely solely on a key statistical test (Pearson R) to indicate that it’s there. The authors demonstrated how MIC manages to detect correlations between all these complex relationship types – Cubic, Exponential, Sinusoidal – and does it really well. The went further. The created a program that can mine very large datasets and suggest relationships to examine.

What’s the Problem?

Remember that the longer you look, the more likely you’ll find something false, idea? The entire idea of hypothesis testing as the basis of quantitative analysis is an entrenched one. It’s an idea that causes resistance to advanced machine learning algorithms and pattern discovery. Reshef really did a great job in explaining the purpose of MIC. Reshef has merely stated that this is a hypothesis informing machine. You can use the program and MIC to discover relationships that were once really quite hidden. Or very, very difficult to discover without insanely expensive software. I think this is great.

The Opportunity

We’re generating huge amounts of data. The big feature big data problem is increasingly common. This is a great tool to rapidly inform hypotheses – to become smarter before getting smarter. It’s a welcome advancement, and worthy of attention.

If you hear of MIC, just know that a MIC of 0.00 means that there is no correlation between two variables, and that a MIC of 1.00 indicates a perfect correlation between two variables. Be aware that MIC does not imply linearity between the variables, but may be of a much higher order function. The second question you should ask upon hearing a MIC score is ‘at what confidence interval is it significant?’, and, ‘what kind of relationship is it?’. Then deep dive.

I’m excited.

How to predict how many visits a website will receive on a given day

Predictive analytics is somewhat mysterious. So, let’s shed some light on it.

(Note that I’m simplifying this quite a bit to be accessible.)

The first step in predictive analytics is to understand what you’re predicting. We’ll call this the Y variable.

In this instance, ‘how many visits from Boston can I expect on a given day’. My Y will be ‘Visits’.

I’m curious about it.

Have some discipline. I see way too many analysts change the Y variable before their investigation is through.

The second step is to identify all the variables that might be associated with a variation in Y. These might include factors like paid media, search, new visits, returning visits – and date. Then there are paid campaigns, posting new content, social campaigns, traditional media spend, promotions, and so on. Day of the week is another key variable, along with statutory holidays, and extending out to other factors like weather and creativity.

The third step is to extract, transform, and load the data you CAN actually access. You can spend months fighting to build an absolute complete model, or, you CAN start putting together a story with the facts that are available. I chose action over inertia. You should too.

That date field is usually pretty bad to extract, transform, and load. There are functions both in excel and SPSS that handle dates with some difficulty. Devils abound in the details around ‘the date where in the world’. If your installation is set to Eastern Time, and most of your traffic comes from Australia, you’ll be one day lagged. You ought to adjust the figures using the appropriate offset.

The figure below is what I could extract from Google Analytics in about an hour. (Collinearity abounds!)

The fourth step is to run the math against your model.

I use SPSS to run a regression. If you don’t have SPSS, you can try using open source programs like Octave or R. The reason for using software is because it’s annoying to do by hand. I didn’t enjoy a copy of SPSS at my first research position, so I had to code out linear regression in Excel. I learned a lot, but it is not expedient!

The figure below is the output from the software.

The way to read the table is Y = Constant + B1(X1) + B2(X2).

So, Visits = 4.888 – 1.872 (istheweekend).

If it’s the weekend, I can predict Visits = 4.888 – 1.872 (1). Which equals 3 visits.

If it’s not the weekend, I can predict Visits = 4.888 – 1.872(0). Which equals 4.888.

Not bad for Boston traffic! And I understand the impact of a single variable on visits.

My dataset is incredibly spikey. So, what’s causing some of that spikyness? I went through all the dates that I posted new content – reran the math, and got the table below.

The model above is the best. It explains 12.7% of the variance in the set.

The equation is: Visits = 4.496 -1.76(istheweekend) + 2.482(newpost).

I can tell – according to this version of reality – that if I want the maximum bump from Boston, posting during the weekday is best. And I can tell the proportional impact of each variable.

Sometimes this answer is good enough. There are more advanced methods – like curvilinear regression, machine learning, and neural networks. There are ways to introduce more variables into the equation. But typically – this method is sufficient to get a first idea about the relationships among variables and their relative importance, rooted in fact, as opposed to gut bias.

The fifth step is to make decisions based on scenarios.

If you take this equation and plot it out, you can engage in a few what-if’s. Would writing more weekend friendly material result in a lower Beta? Would increasing the frequency of new posts drastically improve the performance of the website? If so, by how much? The size of the newpost beta, as compared to the total number of Boston visits per day hints at that relative strength.

That’s the power of predictive analytics.

Siri and Search

Gary Morgenthaler had a few interesting statements to make:

“Therefore, when Siri was an independent company, its plan was to map these domains deeply and seamlessly to automate transactions for its users within them. For example, “Buy that Steve Jobs biography book and send it to my dad”; “Send a dozen yellow roses to my wife”; “Book me the usual table for 2 tonight at 8 p.m. at Giovanni’s”; and “Get me 2 box seats for the Giants game on Saturday.”

Then comes the question of what solves our biggest problems. Ultimately, Siri’s value is that of automation and removing “friction” on the Internet. Siri achieves this by: (1) understanding speech input in natural language form, (2) mapping user requests against its knowledge base (i.e., ontological domains) and (3) activating software “agents” to interact with Internet service providers to fulfill user requests.”

Source: TechCrunch

Let’s just forget Google for a minutes and focus in on this combination of technologies.

  • Understand.
  • Map.
  • Act.

That’s the general design pattern for a whole range of applications.

Certainly nothing new here.

They’ve solved a good problem. There are certain use cases for which Siri is a great solution.

He ignores the rest of the problem space. And that’s just fine. I don’t expect him to point out the subset of infinite use cases that Siri is woefully inadequate for.

Barriers, like a small keyboard, are soon to be resolved by virtual keypads and a range of next generation hand gestures that are sensed, not tactically received. I don’t see them as insurmountable.

Even Star Trek TNG made use of both voice and physical commands.

Siri is not a Google-search killer.

It is a nice complement.