So who keeps on downvoting you on Reddit? We’ll find out.
But first – three notes:
- You may be familiar with Reddit. If you’re not – you can read this explanation about what Reddit is.
- To answer that question, I downloaded a dataset that was built in early 2011 or very late 2010. The dataset is a 29MB gzip compressed and contains 7,405,561 votes from 31,927 users over 2,046,401 links. You can read about the methodology here.
- The file contains three columns – a vote, a userid, and a link. Only people who had their privacy settings set to open had that data read by an API. There is no meta-data about who these people are in real life (IRL) or even what was the nature of the content they were upvoting and downvoting.
So, who’s downvoting you on reddit?
To find out, I took that huge file transformed it into another one – boiling it down into a single user name, how many times that username vote (numberofvotes), and the average of all their votes.
You can see below that _mike voted 26 times, and, if you take the average of all his votes, +1 for an upvote and -1 for a downvote, it turns out to be -.92. Basically, _mike didn’t like a lot of what he saw. In fact, _mike upvoted once (+1) and downvoted 25 times (-25). So (- 25) + (+1) is -24, and -24/26 is -.92.
There are over 30,000 usernames here – and that’s a lot of data. It’s really important to visualize the data before you really get into any analysis. One way to do that is to run a histogram.
To read the histogram below, remember:
- Frequency means ‘the number of usernames that fall into this category or range’.
- Numberofvotes means ‘the number of times a username voted.’
- Mean is another word for average.
There are three takeaways from the histogram above:
- The average number of votes by a username was 234.
- A large number of usernames didn’t vote very many times at all.
- There are bumps at 1000 and 2000 votes. (If you’re interested as to why – see the Methodological notes. Incidentally – this is why you should always visualize your data.)
A histogram is built from a Frequency Table, which we’ll see below.
The way to read a frequency table is:
- The ‘Valid’ column means ‘how many times a username voted’.
- Frequency means ‘the number of usernames that falls into this category’.
- Percent means ‘the percentage of all the usernames that those in this category represents’.
There are three takeaways from the Frequency Table above:
- 4877 of the usernames only voted one time (It’s likely they submitted a single link and never returned).
- Note how both the percentages and number of usernames in each category decrease.
- 50.1% of all the usernames voted 20 times or less. (Look at the cumulative percent column and make sure that makes sense to you. We’re going to use this column later.)
You may have heard the term ‘long tail’ many times before. This is a demonstration of what that means. The bars on the histogram falls away to right.
Recall that the average of all the votes a username made is called ‘averagevote’. If somebody was persistently downvoting links, they’d have a negative number. If they upvoted everything they saw, they’d have an averagevote of +1.
- Negativity follows a reverse long tail. (It really happens – see how the figures fall away to left)
- On average, usernames upvoted what they saw (average 0.79).
- There are bumps at 0 (related to a methodological note) and at -1.
By now, two of my good friends in London are screaming at the screen. Means are a horrible way to explain long tail distributions. You can see that now too. Means are giving us a pretty skewed view of the world.
The table below is a byproduct of our Frequency table. It’s aptly labeled ‘Statistics’, and compares these two variables, numberofvotes, and averagevote, side by side. I’ve thrown a yellow box around ‘percentiles’. Recall the cumulative column from previous frequency table.
- 22.8% of all usernames voted 2 times or less.
- 40.8% voted 9 times or less.
The program I’m using is giving me ‘break points’ for those percentiles.
- The median gives a better summary of what’s going on here – half of the usernames voted 20 times or less, and, another set of usernames always upvoted what they saw.
- If I know that roughly 80% of all usernames posted 325 times or less, then I know that 20% of the usernames in my sample posted 325 times or more.
We’re going to use those percentile cutoff points to inform a segmentation, next.
A segmentation is a grouping of records, usually people, into categories. There is not prescription for how to do this. If you talk to a modeller, they’ll tell you about their clustering algorithms. If you talk to a machine learning scientist, they’ll tell you about bump-hunting or unsupervised machine learning clustering. Those are all very good algorithms. I use them myself.
I’m going for simplicity here. I have these four percentile cut-off points that evenly cut people into five categories. And, for further simplicity, instead of referring to a group of people who posted between 9 and 48 times as ‘those who posted between 9 and 48 times’, I’m going to call them Average-Andy’s. And I’ll just keep on calling them that.
At this point, I don’t know if they’re male or female. (And we won’t in this thread). And it’s controversial to use alliteration. But it’s done.
So, mapping the percentiles against a segmentation, based on how many times a username voted, we have:
- 1 time: One-Time-Oliver
- 2 to 9 times: Vanity-Vanessa
- 9 to 48 times: Average-Andy
- 48 to 325 times: Frequent-Fred
- More than 325 times: Power-Pauline
Take a look at the result below – a variable I’m calling ‘equalseg’ – short for ‘equal segmentation’.
- There are 4877 One-Time-Olivers, representing 15.5% of the usernames in the sample.
- Vanity-Vanessa’s represent 23.9% of the usernames.
- The last three segments are pretty equally divided – the first two are more lopsided.
Even though I aimed to have five groups of people with equal numbers in each, you can see the division between One-Time-Olivers and Vanity-Vanessa’s are off. This happens very often when segmenting a long tail into equal groups. And, while not ideal, it’s okay for our purposes.
Next, we’re going to examine each segment individually.
There are very efficient ways that statisticians quickly summarize and understand the relationship among variables. The aim here isn’t to be efficient – but to be clear. In that spirit, I give you the histogram below.
- All 4877 One-Time-Olivers voted exactly one time.
You should lol. It makes sense though, right? And, the segment name should make a lot more sense.
The histogram below summarizes how, on average, One-Time-Olivers voted – positive or negative. Since they only voted one time, it’s either an upvote, or a downvote. A +1 or -1 average.
- One-Time-Oliver’s tend to upvote once, and are never heard from again.
- In answering the question – “Who’s downvoting you on Reddit”, it isn’t One-Time-Olivers.
Vanity accounts frequently enter Reddit, they flicker, and they go out. They get discouraged. They never really commit to the bit. That’s what happens to them. The histogram below takes on that familiar long-tail curve.
- There are lot of Vanity-Vanessa’s, some 7,527 of them.
- Most of them posted only 2, 3, or 4 times.
So, how did they vote?
The histogram below summarizes the story:
- Vanity-Vanessa’s upvoted nearly everything they saw, with very few exceptions.
- Very few persistently downvoted everything they saw.
- They’re not the ones downvoting you on Reddit.
Recall that the average username votes 326 times, and yet, I still labeled Average-Andy, ranging between 9 and 48 votes, as average andy. That’s because the mean number of votes that Average-Andy’s cast is 22.25 – which is close to the median of 20 for the entire set.
This mixing and abstraction of median, mean, and segmentation isn’t something that I expect most people to consider or think about, but I can foresee some getting hung up on it. When you think about an equal segmentation though, it makes sense that the mean of your middle category should be close to the median of the entire set.
For everybody else – just know that you’re you’re looking at the “average joe redditor” here.
- Average number of votes is 22.25, close to the median of 20 for the whole set.
- Familiar long tail.
How do they vote?
- A majority of Average Andy’s liked everything they saw – they upovoted everything.
- They downvote more often than Vanity-Vanessa’s or One-Time-Oliver’s, but not massively.
- They aren’t downvoting in such a huge way to say that these are the ones downvoting you on reddit.
By now you’re pretty much a pro at reading these histograms. Frequent Fred’s vote frequently. Look at the histogram below.
- Classic long-tail continues.
- Averaging 139.3 votes.
- The unusual bump at the beginning of the series is just magnified by the scale from the previous vote frequency histogram. (It’s fine).
How do they vote?
- Far fewer of them are likely to upvote absolutely everything they see.
- There’s significant flattening of the long tail – the average is .74.
- More of them, on average, are disposed to downvoting.
Power Paulines are the most difficult group to analyze, but the easiest to summarize and understand. Take a look at the histogram below.
- The long tail is holding – there’s significant clustering at 1000 and 2000.
- The cause is related to rate limiting within the Reddit API.
- The longest part of the long tail – those power users with thousands and thousands of votes, are all bundled and clustered together at 2000.
- There are around 500 of such power users, representing some 1.5% of the total usernames.
So how do they vote?
- The bump at 0 is caused by 1000 upvotes getting averaged out by 1000 upvotes.
- 0’s aside, which are tugging on the mean, Pauline’s are on average more prone to downvoting.
- Power Paulines are downvoting you on Reddit.
Putting a bow on it
The chart below summarizes the relationship between segment and their average vote. You can see a clear negative direction. The more one uses Reddit, the more one downvotes – even if the mean is exaggerated in the Power Pauline segment.
To really hammer the point home about the origin of downovotes, take a look a the table below. It’s broken out by the segments you understand. It also contains two new variables – upvotes and downvotes. That is the total count of the number of upvotes and downvotes made by each segment.
- One-Time Olivers as a group were responsible for 175 of all the downvotes cast.
- Vanity-Vanessa’s as a group were responsible for 1781 of all the downvotes cast.
- Average-Andy’s as a group were responsible for 13,258 of all the downvotes cast.
- Frequent-Fred as a group were responsible for 120,758 of all the downvotes cast.
- Power-Paulines as a group were responsible for 1,672,368 of all the OBSERVED downvotes cast – but are probably responsible for a lot more in aggregate across all of Reddit. (This sample contains a bias, but bias doesn’t mean I can’t say anything at all about anything.)
Note the differences in order of magnitude between each group. 1781 is roughly 10 times greater than 175. And so, a bit imperfectly on the way up to Frequent-Fred’s. There’s an order of magnitude difference here in terms of the amount of weight each group casts.
The greatest power users users of Reddit are the ones who are downvoting you – and it’s an exponential power.
But wait, there’s more.
Recall, however, that there over 7 million votes cast. 1.8 million were downvotes, and 5.5 million were upvotes. Read the statistics table below to verify that.
- Upvotes outnumber downvotes.
- The interface of Reddit itself causes upvotes to accumulate.
- Reddit itself is a cause of a bias – probably by design.
The histogram below is by links – the content getting upvoted or downvoted. There were just over 2 million links submitted. On average, each link received 3.62 upvotes. Given everything you know about long tails, think about just how deceptive that 3.62 mean figure is. Note how you can’t even see the bumps in the tail. And be in awe of the efficiency of the collective Reddit behavior that causes popular content to disproportionately promoted while even ‘good’ or ‘average’ content gets relentlessly shifted to the left – all by a very small group of people.
- The long tail is long and powerful.
- This small group Power-Paulines are far more likely to downvote because of a much higher frequency of use.
I’m thanking Reddit for making so many API’s publicly exposed and enabling this sort of analysis and exploration. Thank you.
Portions of this post appeared on Eyes On Analytics the week of February 5, 2012.