So who keeps on downvoting you on Reddit? We’ll find out. But first – three notes: You may be familiar with Reddit. If you’re not – you can read this explanation about what Reddit is. To answer that question, I downloaded a dataset that was built in early 2011 or very late 2010. The dataset is a 29MB gzip compressed and contains 7,405,561 votes from 31,927 users over 2,046,401 links. You can read about the methodology here. The file contains three columns – a vote, a userid, and a link. Only people who had their privacy settings set to open had that data read by an API. There is no meta-data about who these people are in real life (IRL)[…]
Author: Christopher Berry
This is the fifth in a series of five posts about Reddit and Analytics. Previously – we covered the nature of the dataset, read histograms, generated segments, and understood that the most frequent users of Reddit are the ones who are doing the most downvoting by an astounding margin. But wait, there’s more. Recall, however, that there over 7 million votes cast. 1.8 million were downvotes, and 5.5 million were upvotes. Read the statistics table below to verify that. Takeaways: Upvotes outnumber downvotes. The interface of Reddit itself causes upvotes to accumulate. Reddit itself is a cause of a bias – probably by design. The histogram below is by links – the content getting upvoted or downvoted. There were just[…]
This is the fourth in a series of five posts about Reddit and Analytics. The complete thread will be posted at the end of the week. Previously – we covered the nature of the dataset, read histograms, generated segments, and examined them. Putting a bow on it The chart below summarizes the relationship between segment and their average vote. You can see a clear negative direction. The more one uses Reddit, the more one downvotes – even if the mean is exaggerated in the Power Pauline segment. To really hammer the point home about the origin of downovotes, take a look a the table below. It’s broken out by the segments you understand. It also contains two new variables –[…]
This is the third in a series of five posts about Reddit and Analytics. The complete thread will be posted at the end of the week. Previously – we covered the nature of the dataset, read histograms, and generated our segments. Now we’re going to examine each segment individually. One-Time-Olivers There are very efficient ways that statisticians quickly summarize and understand the relationship among variables. The aim here isn’t to be efficient – but to be clear. In that spirit, I give you the histogram below.Takeaways: All 4877 One-Time-Olivers voted exactly one time. You should lol. It makes sense though, right? And, the segment name should make a lot more sense.The histogram below summarizes how, on average, One-Time-Olivers voted –[…]
This is the second in a series of five posts about Reddit and Analytics. The complete thread will be posted at the end of the week. Recall that the average of all the votes a username made is called ‘averagevote’. If somebody was persistently downvoting links, they’d have a negative number. If they upvoted everything they saw, they’d have an averagevote of +1. Read the histogram below. The three takeaways are: Negativity follows a reverse long tail. (It really happens – see how the figures fall away to left) On average, usernames upvoted what they saw (average 0.79). There are bumps at 0 (related to a methodological note) and at -1. By now, two of my good friends in London[…]
This is the first in a series of five posts about Reddit and Analytics. The complete thread will be posted at the end of the week. So who keeps on downvoting you on Reddit? We’ll find out. But first – three notes: You may be familiar with Reddit. If you’re not – you can read this explanation about what Reddit is. To answer that question, I downloaded a dataset that was built in early 2011 or very late 2010. The dataset is a 29MB gzip compressed and contains 7,405,561 votes from 31,927 users over 2,046,401 links. You can read about the methodology here. The file contains three columns – a vote, a userid, and a link. Only people who had[…]
Kurt wrote an excellent post about building a data science team. It’s excellent and it’s worth reading. To expand off his points: The first 90 days provide fuel for the subsequent 180. The 180 days after are far muddier, because what was scaling in very unsophisticated interfaces require a lot more work to become elegant solutions. Data scientists should evangelize evidence and do what they can to develop interfaces that democratize the data. The math is a means to the end. Own reflections: I’m extremely thankful for my years of experience with Information Architects and Designers – as now – when I go into a room and they’re not around, I actively think about that end state. I’m glad I’ve[…]
A fellow data scientist and I were debating how to answer a very specific question that is asked all the time by others. How would we answer it? I grabbed a piece of paper and drew a histogram. A histogram: Plots a single variable along the X-axis. Plots the occurrence, or frequency of a given variable along the Y-axis. Is used by statisticians and analysts to understand the frequency distribution of a given variable. I said: “This is how I would want to see the data. This is how I answer the question today. This is what I would want to compare,” Then paused. Reflected. And added, “I am not the end user.” The end user isn’t a statistician, marketing[…]
“Don’t Make Me Think” by Steve Krug is one of my favourite books. I strongly recommend it to web analysts and data scientist. In that spirit – here are a few of my favourite interfaces: pinterest.com rdio.com imgur.com Commonalities: Real choices about what to put in and leave out were made – in other words – they are designed. They were not assembled. Not every surface is crammed with stuff. Just because nature abhors a vacuum doesn’t mean you need to cram something into every pixel. It’s obvious what everything does. Simple can be functional. What are your nominations? *** I’m Christopher Berry.I tweet about analytics @cjpberryI write at christopherberry.ca
What’s the Return On Investment on Marketing? Depends on how soon you want your return. Time is frequently a neglected variable. Recall that marketing had a schism right around 1920: One man went on to found the branding agency, and found salvation through broadcast radio, and later, TV. One man founded the first direct advertising agency, and continued to find salvation through direct response and cataloging. The schism only really came to a head when digital forced it to come to a head. Implications: Evidence for a direct causal inference between marketing treatment and marketing conversion is greatest at the point of sale / point of conversion. Any evidence of causality is severely diluted at the branding / awareness level[…]