Is hexagon binning really better than scatterplots? It depends. Why this topic, why now? You may have missed Chris Stucchio’s excellent post entitled “Don’t use Scatterplots” on Saturday. It’s causing quite the ruckus. Naturally. Chris used a provocative title and backed it up with a logical foundation. He showed how hexagon binning generates a more accurate view of reality. What’s the difference? This is a scatterplot: It has an X axis and a Y axis (and sometimes a Z) For every case in the data set, a symbol is used to denote where it is Chris’ main point is that dots overlap when multiple cases share the same point. Some other commenters have stated that this can be adjusted for[…]

Excel continues to be THE major tool in analytics. It shouldn’t be. Excel: Does not scale beyond a single computer, and frequently fails to load with very large data sets Does not contain separate model, controller, and viewer modules unless completely forced Allows human beings to make too many big mistakes; too much error On the other hand, Excel: Is easy to use Creates pretty charts that, with effort, can be dragged into PowerPoint presentations Is shareable Is cheap Is fast (compared to building something accurate or scalable) There are broader problems with Excel, namely: They’re prone to complexity creep Engenders disrespect (After all, it’s just pizza and spreadsheets, derp) Are generally not easily importable into statistical software for analysis[…]

Wolfram announced SystemModeler yesterday. You can read the post here. You can see the costs here. Pros: It’s integrated right into the predictive stack It looks a hell of lot prettier than all the other System Modelers out there on the market It’s priced right for students Cons: 99% of the potential users won’t use it because they have to take a course It’s priced well outside of the innovator-technologist range It looks complicated I predict a whole sequence of ‘game changer’ posts are about to overtake us all. It isn’t. But that’ll make for some pretty good title-muppeting. The product is a great extension for a pretty good stack. It looks cool. It may be of interest to many[…]

Can I ask about what you think of #measure, which is the main analytics channel on twitter? Are you happy with what that channel has become? *** I’m Christopher Berry.I’m taking refuge over at #msure.I tweet about analytics @cjpberryI write at christopherberry.ca

On occasion, I use data generated from surveys to experience some empathy with groups of people who I don’t frequently encounter and interview about their realities. A beautiful, publicly available data set is the PEW Internet and American Life Project’s August 2011 Apps and Adult SNS Climate data set. You can access the dataset here. Thank you PEW. You’ll see the tables as I see them. I’m using column percentages. Note that there’s a confidence interval on either side of those percentages. If you don’t understand what a confidence interval is, don’t tweet about the figures you’re seeing. Don’t quote them. We’ll get to that in a bit. (In general, you don’t use tables to communicate with general audiences. Since[…]

It’s awesome to watch Pinterest grow. A post at High Scalability reveals just how much they’ve grown. TL;DR: 80 million objects stored in S3 with 410 terabytes of user data, 10x what they had in August. EC2 instances have grown by 3x. 150 EC2 instances in the web tier 90 instances for in-memory caching, which removes database load And a few notes about technology that caused a smile: Written in Python and Django  Hadoop-based Elastic Map Reduce is used for data analysis and costs only a few hundred dollars a month One of the fastest growing sites in history. Sites AWS for making it possible to handle 18 million visitors in March, a 50% increase from the previous month, with[…]

It’s the Victoria Day long weekend in most parts of Canada, and, whereas our colleagues in Moncton and Halifax will be working, many analytics practitioners in Toronto and Vancouver will be playing German board games and drinking beer. One of the most popular of the German board games is “The Settlers of Catan“. If you know the game, skip ahead to the next bold title. If you don’t know the game, read on for a painless summary. Here’s the TL;DR Summary (Too Long; Didn’t Read): The objective of the game is to win 10 victory points before anybody else does. (Keeps everybody in) You earn two victory points for every city you build, 2 for having the longest road, 2[…]

Consider the following two, distilled, points of view: Statement 1: “Big Data Analytics is going to change the way we do business. Sure, a lot of it will be routine “I’m okay!” status updates from sensors, but making sense of the key parts of it, like “help me, I’m failing”, will be extremely useful. Companies that were previously exempt from competing on analytics will be disrupted by new entrants who will compete better, either by being more effective or being more efficient. Big Data Analytics is already having a disruptive impact in marketing, where it never used to before, and is gaining huge traction in medicine. There is reason to believe that Big Data Analytics will cause better decision making[…]

Is the assumption that better data causes better decisions credible? It depends. If evidence to the contrary of an individuals aspiration comes to light, and that individual refuses to update their expectations or aspiration, then even the most pristine, accurate, precise and real time data will fail to change their mind. If evidence to the contrary of an individuals aspiration comes to light, and that individual updates their expectations or aspirations accordingly, then it will be effective at changing their mind. The key element that decides the effectiveness of data is the human. Great data can cause great managers to make better decisions. Great data doesn’t cure ignorance. Maybe the broader commentary on the value of Big Data has more[…]

Where does the assumption that better data causes better decisions come from? I don’t know for sure. But I can point to two possible sources: The enlightenment and the scientific revolution Robert McNamara and the Whiz Kid movement The entire scientific method is predicated on data. A hypothesis is either accepted as truth or rejected as false based on the data. There is no other arbiter. Faith or strength of opinion has nothing to do with it. The data decides. The assumption that greater knowledge causes greater outcomes flows from that fact. That, if you’re honest about being wrong, that everybody benefits. (Disturbingly, that trend may be reversing in The West, as negative findings have been disappearing from most disciplines.)[…]