Some very good progress on what a Data Scientist is, and isn’t. @neilraden and @teddy777 have contributed, and here is where it started. – and where we’re at now. Some people say: The definition of a scientist is somebody who does original research and publishes in peer reviewed journals. Most people who call themselves data scientists aren’t actually scientists. Data scientists should be stratified depending on the sophistication of the tools they use. A few points to make: Science is a learning algorithm. If you’re executing the algorithm, then you’re doing science. If you execute the algorithm frequently, then you’re a scientist. Science is what you do. Most people aren’t scientists because they don’t actually use the scientific method. Consider[…]

Sometimes the components of a marketing channel will not add up to equal the total performance of the marketing channel. This is caused by any number of realities and limitations imposed in part by nature, and, in part, by you, the marketer. Consider the following deliberately simple scenario: March 2012 Impressions: Total Digital Impressions Delivered: 100,000,000 Total Impressions with Chicken Creative: 25,000,000 Total Impressions with Beef Creative: 50,000,000 Total Impressions with Pork Creative: 75,000,000 Something doesn’t make sense. I’m telling you that 100,000,000 impressions were delivered in total, but each component of that figure: 25 million, 50 million, and 75 million, don’t actually add up. That’s because creative can have multiple attributes. An ad may feature Chicken alone, Beef alone,[…]

Some people who rely on their gut argue that data driven decision making causes analysis paralysis. Some people who cause analysis paralysis have very good reasons for appearing to be paralyzed. I can think of three classifications of inquiry that correspond to three levels of information sufficiency. Specifically: Convenient reasoners, those who know what they know, and are looking for evidence to support their case, have enough information when they feel like they have enough compelling evidence for their case, and no more. Those who have a hypothesis will be temporarily satisfied with a firm accept/reject. The very next inquiry will either be based on convenient reasoning, or, another hypothesis. Those who don’t know what they don’t know and have[…]

The 2011 Canadian Election Study is available for download. You can get the file here. (If you don’t have SPSS, you can load it into R using the SPSS import functions.) I invite you to explore it. What is it? An entire generation of Canadian market researchers and pollsters grew up on the Canadian Election Study (CES). And there are a lot of them! There were federal elections in Canada in 1997, 2000, 2004, 2006, 2008, and 2011. That’s 6 elections in 14 years! It generated an incredible amount of publicly available data about political attitudes in Canada. Research uses aside, the CES is used to teach Canadians about electoral behavior. It is among the most studied data sets in[…]

Sucharita Mulpuru is among my favorite people at Forrester. She’s pragmatic, technical, and, in my view, a brilliant forecaster – three key skills and traits to be an effective strategist. Last month she wrote “Why Facebook Is Still A Tough Sell For Retailers“. She was called out as a hater. I don’t think that was fair. The crystallizing quote comes from an interview two weeks later – and you may have seen this article  in Bloomberg on February 22. TL;DR: “There was a lot of anticipation that Facebook would turn into a new destination, a store, a place where people would shop,” Mulpuru said in a telephone interview. “But it was like trying to sell stuff to people while they’re[…]

Web analytics uses clickstream data. It’s data that is: Generally Anonymous Generally Aggregated Heavily Abstracted Most commercial web analytics software abstracts away the raw data with fairly usable interfaces. You’ll be hard pressed to find many people these days who know how to work with server log data. Yet, it’s still possible to segment a population of browsers based on the characteristics of the browser, computer, and reverse geographic lookup. That is to say, I can query, through the software, the differences between IE browsers in Toronto originating from Reddit from, say, Chrome browsers in New York originating from search. And then I can compare the differences between them. If it’s an eCommerce site, I may even be able to[…]

I’m a reader of Theory of Reddit. This thread, entitled “Who’s manipulating Reddit and how? Who’s buying votes.” is very interesting. Farshad signed up for a program. He upvoted a link. He’s eligible to get paid 8 cents. Read on. This is all unverified. I’m inclined to believe that Farshad is genuine, owing to the fact the account has 2 years of tenure and has reddit gold. It seems genuine. And so, if it’s a troll, this would be a pretty damn esoteric one. It stands to reason that somebody is executing the experiment. To the author’s credit, he proposes a few ways that a machine could detect paid upvoting, including comment quality and concurrency among recently created accounts. It[…]

Metaphor: A putter is PowerPoint / Excel. A 3 Iron is SPSS or R. A 1 Wood is python or octave. Here we go: The putter is used when you’re on the green and trying to get it into the hole. Lots of nuance on the slope. Lots of finesse with arms and angles. High variability. Little repeatability. It doesn’t scale, won’t scale, and doesn’t need to.  The 3 Iron has a bit more power. It’s really great for getting onto the green from certain approaches. The wood has a lot of power. It’s really great for driving right down the range. It’s essential on the par 5’s. Knowing which tool when is important, right? Unless you’re playing mini putt,[…]

A few highlights from eMetrics SF 2012: The Web Analytics Association is now the Digital Analytics Association. There are now several shareable end-to-end B2B tracking case studies that are finally out, including Intuit and Symantec. Special call out the Michael Parker who spoke clearly and persuasively during his keynote. Prezi makes for some pretty engaging presentations. Digital Analytics is now a young adult. Advice for when you’re back in the office: You’re standing in front of a golf ball. Every millimeter on where you strike that ball makes a big difference months out. Tee up, and make it a really great swing. Take thirty minutes to integrate your notes into your work checklist / burn list. If you were inspired[…]

There’s a good discussion going on within data science circles. First, a brief background: It has been observed that complex models, generated by Machine Learning, are frequently more predictive (and developed faster), than the alternative approach of Domain Experts (Subject Matter Experts) generating a model and deploying it the field. This observation forms a very major fault line in science more generally. What happened: An Oxford-Style debate was held at Strata, pitting Machine Learning against Domain Expertise. Machine Learning won. If the problem to be solved is well framed, the machine trumps domain. However, the problem has to be well framed. What it means: Alistair Croll crystallizes the implication in this piece. And rightly eggs us on to go forth[…]