I’m taking it easy on the twitter account – so: Here are the nuggets: On Web Mining: The web is an infinite series of edge cases. The Robots.txt is not a terms of use document. Scraping should be done ethically, respect the robots.txt, respect their rate limits, be transparent about who you are and why you’re taking data from them. I was certainly knowledgeable about the practical difficulties of mining the web. I have a much greater appreciation for just how tough mining the web is coming out of this session. The fact that there are so many edge cases makes writing defensive code extremely difficult. Predictive Models: The complexity of the model needs only to be proportional to the[…]

Open data and surveillance are related. Surveillance means to watch over. We usually think of surveillance not being consensual, but it can be. For instance: The surveillance society is most obvious in Great Britain, with its CCTV network. There have been rows in Canada between citizens and governments. Toronto’s transit authority, the TTC, installed cameras in streetcars. Toronto citizens then turned their camera phones on TTC workers – sleeping and engaging in other bad behavior. There have been recent troubles in Ottawa about warrantless Internet wiretapping. Publicly available information about the minister responsible data was published in direct response. Surveillance is traditionally thought of as a government policy instrument used to protect various groups, including itself, from threats. Surveillance is[…]

Last night, some 400 others and I attended the Big Data Camp, held the night before the #strataconf kicks off. I saw: The Bay Areas version of Matt Milan run an unconference with 400 participants, and manage it very, very, well. A group of engineers argue that the definition of real time varied by the application, and that the issue of time-data reconciliation in a cluster nodes, the universal clock problem, could be avoided by building systems that avoid imposing a single definition of time altogether. Justin from BigML use one of his beautiful predictive models. I heard: Meaningful jargon. Words meant what I knew them to mean. Repeated distinctions between transactional data and sensor data. (For some reason?) No[…]

Sam Ladner, one of the best minds in real, actual, consumer ethnographic research, and author of one of the most influential theses on how we think of time in digital agencies, wrote an excellent comment on a previous post “Business Intelligence is not Data Science“. Sam wrote: “Curious and curiouser! You have taken up the sword “of the customer,” just as many other disciplines have claimed to do in the past (e.g., design, marketing, market research, “social” business). I find it quite interesting that everyone is clamouring all over each other to claim that no REALLY they CARE about the “customer” and those other disciplines do not. What is going on here? Why don’t we just say that capitalism has[…]

I’m doing a presentation on the Data Science of Marketing Analytics at the Strata Conference. It’ll be on Wednesday, February 29th, at 4:50pm in Mission City B4. The content has been optimized for a 4:50pm time slot. I’m telling a real story about a real data science project. It’s the newest presentation out there. I know, I just invented it.

It’s very easy to be cynical about new ideas, especially when they’ve been previously hyped and previously failed. Ideas fail. Statistically, failure is the norm.  I’ve been asking myself the question: “What’s different today that might make yesterday’s fad become sustainable?” There are three broad analytical areas that are prime for re-discovery and a fresh round of hype: Splimes. Augmented Reality – GIS. Website Morphing. Reasons for skepticism: I don’t want my refrigerator to tweet when it’s empty. I don’t want to give brands yet another channel to spam me with coupons. I find the Internet hard enough to use, I don’t need my favourite sites changes all the time. What’s different now: I want to make things that are[…]

Business Intelligence is not Data Science. There’s a lot of ‘yeah but’ statements eminating from some in the BI community. TL;DR summary: Yeah but, it’s all about driving business insights from the data! Yeah but, Data Science still uses all the same BI tools we use! Yeah but, Data Science is really just what BI was years ago! A perspective: No. BI is about using asymmetrical information advantage to extract surplus from customers. Data Science is discovering pareto optima between the customer and the business. No. Data Science is not religious about toolsets. No. Data Scientists have seen what went gone wrong with BI. Achieving the same fate would be a failure. What I stand for as a Data Scientist:[…]

Have you ever heard anybody use the sentence: “The problem with that model is that it over fits the data.” Ever wonder that that means? The purpose of science is to use knowledge to make good predictions about the future. To do so, you use theories which inform models. Models are deliberate simplifications of the world which make explicit statements about the direction of the arrow of causality, and are judged to be useful only if the assumptions are actually good. A good model makes accurate predictions about the future. That supposes that the assumptions which underpin the model are actual best-proxies for how nature actually works. [Data scientists: If you have a problem with what I wrote here, leave[…]

Yesterday I concluded that “Existing theoretical frameworks assume too much, and demand too much cognition by the end user.” The opposite of asking you to think about linear regression or support vector machines is Netflix. Netflix uses a machine algorithm to suggest movies that you might like. They do this using a few sources: When you first sign up, Netflix asks a few questions about you.  They have a prior viewing history of all their subscribers before you, who also answered a few questions about themselves. Y You tell them what you like by watching various movies and shows.  You tell them more by rating them on a five star rating system. By comparing your tastes to other people like[…]

James March explains that making a decision involves understanding alternatives, forming expectations about what’s likely to happen, thinking about your preferences in terms of your wants, fears, hopes, dreams in relation to those expectations, and then making a choice. That explanation really resonates. So we’re going to use it here. There’s an assumption that choice amongst alternatives is cut and dry. It isn’t. Choice is a form of knowledge – specifically: There are choices that you know you know. There are choices that you know you don’t know about.  And there are choices you know you don’t know. Choices themselves aren’t even really binary. There’s significant ambiguity as to what a choice really means. How many times have you heard[…]