This is the second in a series of five posts about Reddit and Analytics. The complete thread will be posted at the end of the week.

Recall that the average of all the votes a username made is called ‘averagevote’. If somebody was persistently downvoting links, they’d have a negative number. If they upvoted everything they saw, they’d have an averagevote of +1.

Read the histogram below.   
The three takeaways are:

  • Negativity follows a reverse long tail. (It really happens – see how the figures fall away to left)
  • On average, usernames upvoted what they saw (average 0.79).
  • There are bumps at 0 (related to a methodological note) and at -1.

By now, two of my good friends in London are screaming at the screen. Means are a horrible way to explain long tail distributions. You can see that now too. Means are giving us a pretty skewed view of the world.

The table below is a byproduct of our Frequency table. It’s aptly labeled ‘Statistics’, and compares these two variables, numberofvotes, and averagevote, side by side. I’ve thrown a yellow box around ‘percentiles’. Recall the cumulative column from previous frequency table.

  • 22.8% of all usernames voted 2 times or less.
  • 40.8% voted 9 times or less.

The program I’m using is giving me ‘break points’ for those percentiles.

Two takeaways:

  • The median gives a better summary of what’s going on here – half of the usernames voted 20 times or less, and, another set of usernames always upvoted what they saw.
  • If I know that roughly 80% of all usernames posted 325 times or less, then I know that 20% of the usernames in my sample posted 325 times or more.

We’re going to use those percentile cutoff points to inform a segmentation, next.

Segmentation

A segmentation is a grouping of records, usually people, into categories. There is not prescription for how to do this. If you talk to a modeller, they’ll tell you about their clustering algorithms. If you talk to a machine learning scientist, they’ll tell you about bump-hunting or unsupervised machine learning clustering. Those are all very good algorithms. I use them myself.

I’m going for simplicity here. I have these four percentile cut-off points that evenly cut people into five categories. And, for further simplicity, instead of referring to a group of people who posted between 9 and 48 times as ‘those who posted between 9 and 48 times’, I’m going to call them Average-Andy’s. And I’ll just keep on calling them that.

At this point, I don’t know if they’re male or female. (And we won’t in this thread). And it’s controversial to use alliteration. But it’s done.

So, mapping the percentiles against a segmentation, based on how many times a username voted, we have:

  • 1 time: One-Time-Oliver
  • 2 to 9 times: Vanity-Vanessa
  • 9 to 48 times: Average-Andy
  • 48 to 325 times: Frequent-Fred
  • More than 325 times: Power-Pauline

Take a look at the result below – a variable I’m calling ‘equalseg’ – short for ‘equal segmentation’.

Takeaways:

  • There are 4877 One-Time-Olivers, representing 15.5% of the usernames in the sample.
  • Vanity-Vanessa’s represent 23.9% of the usernames.
  • The last three segments are pretty equally divided – the first two are more lopsided.

Even though I aimed to have five groups of people with equal numbers in each, you can see the division between One-Time-Olivers and Vanity-Vanessa’s are off. This happens very often when segmenting a long tail into equal groups. And, while not ideal, it’s okay for our purposes.

Next, we’re going to examine each segment individually.

Tomorrow we’ll look at the voting characteristics of each segment.

 ***

I’m Christopher Berry.
I tweet about analytics @cjpberry
I write at christopherberry.ca