Consider: Only 97% of analysts use Excel at some point in their careers. Only 3% of web analysts have yet to use R. A whopping 0.3% of web analysts have downloaded PANDAS since Monday. Now consider: A whopping 97% of analysts use Excel at some point in their careers. A whopping 3% have used R. Only 0.3% of web analysts have downloaded PANDAS since Monday. Leading words shape perception. Perception shapes both what is asked and biases within what is asked next. For Instance: Who the hell are the 3% who haven’t used Excel? Why is Excel such a dominant tool at 97%? And Next: Why aren’t way more web analysts using R? Wow, what do those 3% of web[…]
Category: Uncategorized
Scott Hanselman wrote an excellent piece on App geo-location data. If there’s a nobel prize for writing blog titles, he would win it. The piece is entitled: It’s 2012 and your kids have an iPhone – Do you know where they are? I do. Admiration aside, yes, you’re living through one of the greatest rises of applied Geographic Information Systems (GIS), ever. It’s bigger than the launching of the first weather satellite. Or LandSat. This time, it’s millions of people equipped with sensors. And they’re doing the sensing. Many apps use geo-location data as a function of what they do, of varying utility, for the user: There are traffic congestion apps that rely on applied GIS – to crowdsource intelligence[…]
This is the fifth in a series of five posts about Reddit and Analytics. Previously – we covered the nature of the dataset, read histograms, generated segments, and understood that the most frequent users of Reddit are the ones who are doing the most downvoting by an astounding margin. But wait, there’s more. Recall, however, that there over 7 million votes cast. 1.8 million were downvotes, and 5.5 million were upvotes. Read the statistics table below to verify that. Takeaways: Upvotes outnumber downvotes. The interface of Reddit itself causes upvotes to accumulate. Reddit itself is a cause of a bias – probably by design. The histogram below is by links – the content getting upvoted or downvoted. There were just[…]
This is the fourth in a series of five posts about Reddit and Analytics. The complete thread will be posted at the end of the week. Previously – we covered the nature of the dataset, read histograms, generated segments, and examined them. Putting a bow on it The chart below summarizes the relationship between segment and their average vote. You can see a clear negative direction. The more one uses Reddit, the more one downvotes – even if the mean is exaggerated in the Power Pauline segment. To really hammer the point home about the origin of downovotes, take a look a the table below. It’s broken out by the segments you understand. It also contains two new variables –[…]
This is the third in a series of five posts about Reddit and Analytics. The complete thread will be posted at the end of the week. Previously – we covered the nature of the dataset, read histograms, and generated our segments. Now we’re going to examine each segment individually. One-Time-Olivers There are very efficient ways that statisticians quickly summarize and understand the relationship among variables. The aim here isn’t to be efficient – but to be clear. In that spirit, I give you the histogram below.Takeaways: All 4877 One-Time-Olivers voted exactly one time. You should lol. It makes sense though, right? And, the segment name should make a lot more sense.The histogram below summarizes how, on average, One-Time-Olivers voted –[…]
This is the second in a series of five posts about Reddit and Analytics. The complete thread will be posted at the end of the week. Recall that the average of all the votes a username made is called ‘averagevote’. If somebody was persistently downvoting links, they’d have a negative number. If they upvoted everything they saw, they’d have an averagevote of +1. Read the histogram below. The three takeaways are: Negativity follows a reverse long tail. (It really happens – see how the figures fall away to left) On average, usernames upvoted what they saw (average 0.79). There are bumps at 0 (related to a methodological note) and at -1. By now, two of my good friends in London[…]
This is the first in a series of five posts about Reddit and Analytics. The complete thread will be posted at the end of the week. So who keeps on downvoting you on Reddit? We’ll find out. But first – three notes: You may be familiar with Reddit. If you’re not – you can read this explanation about what Reddit is. To answer that question, I downloaded a dataset that was built in early 2011 or very late 2010. The dataset is a 29MB gzip compressed and contains 7,405,561 votes from 31,927 users over 2,046,401 links. You can read about the methodology here. The file contains three columns – a vote, a userid, and a link. Only people who had[…]
Kurt wrote an excellent post about building a data science team. It’s excellent and it’s worth reading. To expand off his points: The first 90 days provide fuel for the subsequent 180. The 180 days after are far muddier, because what was scaling in very unsophisticated interfaces require a lot more work to become elegant solutions. Data scientists should evangelize evidence and do what they can to develop interfaces that democratize the data. The math is a means to the end. Own reflections: I’m extremely thankful for my years of experience with Information Architects and Designers – as now – when I go into a room and they’re not around, I actively think about that end state. I’m glad I’ve[…]
A fellow data scientist and I were debating how to answer a very specific question that is asked all the time by others. How would we answer it? I grabbed a piece of paper and drew a histogram. A histogram: Plots a single variable along the X-axis. Plots the occurrence, or frequency of a given variable along the Y-axis. Is used by statisticians and analysts to understand the frequency distribution of a given variable. I said: “This is how I would want to see the data. This is how I answer the question today. This is what I would want to compare,” Then paused. Reflected. And added, “I am not the end user.” The end user isn’t a statistician, marketing[…]
“Don’t Make Me Think” by Steve Krug is one of my favourite books. I strongly recommend it to web analysts and data scientist. In that spirit – here are a few of my favourite interfaces: pinterest.com rdio.com imgur.com Commonalities: Real choices about what to put in and leave out were made – in other words – they are designed. They were not assembled. Not every surface is crammed with stuff. Just because nature abhors a vacuum doesn’t mean you need to cram something into every pixel. It’s obvious what everything does. Simple can be functional. What are your nominations? *** I’m Christopher Berry.I tweet about analytics @cjpberryI write at christopherberry.ca