This is the first in a series of five posts about Reddit and Analytics. The complete thread will be posted at the end of the week.
So who keeps on downvoting you on Reddit? We’ll find out.
But first – three notes:
- You may be familiar with Reddit. If you’re not – you can read this explanation about what Reddit is.
- To answer that question, I downloaded a dataset that was built in early 2011 or very late 2010. The dataset is a 29MB gzip compressed and contains 7,405,561 votes from 31,927 users over 2,046,401 links. You can read about the methodology here.
- The file contains three columns – a vote, a userid, and a link. Only people who had their privacy settings set to open had that data read by an API. There is no meta-data about who these people are in real life (IRL) or even what was the nature of the content they were upvoting and downvoting.
So, who’s downvoting you on reddit?
To find out, I took that huge file transformed it into another one – boiling it down into a single user name, how many times that username vote (numberofvotes), and the average of all their votes.
You can see below that _mike voted 26 times, and, if you take the average of all his votes, +1 for an upvote and -1 for a downvote, it turns out to be -.92. Basically, _mike didn’t like a lot of what he saw. In fact, _mike upvoted once (+1) and downvoted 25 times (-25). So (- 25) + (+1) is -24, and -24/26 is -.92.
There are over 30,000 usernames here – and that’s a lot of data. It’s really important to visualize the data before you really get into any analysis. One way to do that is to run a histogram.
To read the histogram below, remember:
- Frequency means ‘the number of usernames that fall into this category or range’.
- Numberofvotes means ‘the number of times a username voted.’
- Mean is another word for average.
There are three takeaways from the histogram above:
- The average number of votes by a username was 234.
- A large number of usernames didn’t vote very many times at all.
- There are bumps at 1000 and 2000 votes. (If you’re interested as to why – see the Methodological notes. Incidentally – this is why you should always visualize your data.)
A histogram is built from a Frequency Table, which we’ll see below.
The way to read a frequency table is:
- The ‘Valid’ column means ‘how many times a username voted’.
- Frequency means ‘the number of usernames that falls into this category’.
- Percent means ‘the percentage of all the usernames that those in this category represents’.
There are three takeaways from the Frequency Table above:
- 4877 of the usernames only voted one time (It’s likely they submitted a single link and never returned).
- Note how both the percentages and number of usernames in each category decrease.
- 50.1% of all the usernames voted 20 times or less. (Look at the cumulative percent column and make sure that makes sense to you. We’re going to use this column later.)
You may have heard the term ‘long tail’ many times before. This is a demonstration of what that means. The bars on the histogram falls away to right.
Tomorrow we’ll look at the distribution of votes and do a segmentation.
I’m Christopher Berry.
I tweet about analytics @cjpberry
I write at christopherberry.ca