I’m working on another 5-part blog post series on “How individuals decide”. I’ve hit a snag. And it’s a bad one. It has to do with triggers of search. There are reasons for why people ask for evidence the way they do, and their subsequent reactions to follow up questions. For instance:Questioner: “I need to know how, of how many people visited the Vegan Microsite, who also saw my tweet about Chicken two months later.” Alright. So, there’s obviously a reason why the questioner is asking the question. And it’s a pretty strange one from the outset. What do they mean by ‘saw’. Two months from when? Cause and effect appear to be really messed up from the way I[…]

First, a thread of thought. Second, a brief exhortation. Summary: Chris Broadfoot showed some pretty amazing visualizations he had created using some open data.  Mark Hahnel showed Figshare, which aims to help academics make their data open and available. He’s a big part of the open data movement. Flip Kromer, CTO of Infochimps, built the core technology that made sharing that data set possible, earlier on in the day. (And he kicked my ass in a German boardgame). Why I’m optimistic: I see in Chris’ work was the opportunity for the public and decision makers to make very well informed decisions about transportation policy. Relevant. I see in Mark’s work was the opportunity for others to, with greater ease, replicate[…]

eMetrics San Francisco is this week, and #measure can expect the usual volume of hashtags and quotes. For those of us at home or in the office, the flow can be pretty annoying. That torrent causes a fairly warped view of what’s really going on. eMetrics is far more than the witty one liners delivered in a BIG way in REAL TIME. There’s a lot of substantive material. A few questions to ask yourself: What is the definition of Big Data? What is the definition of Real Time? Can either help me win? Analysts aren’t alone in feeling like there’s too much data coming at them. Is more really better? More data might not be the right answer 80% of[…]

You may have recently clicked a link leading to this paper by Robert Ghrist on Barcodes. You may have also read a previous post about MINE. And finally, this month I talked about histograms and proceeded to subject you to their importance of seeing the data, again and again and again. TL;DR: Seeing the data helps analysts understand the data. Showing the data alone isn’t explaining the data. The first question, in response to seeing a line on a chart, is “why”? Sure, if the line is going up, I caused that. If it’s going down, that’s the weather’s fault. Fine. Those are great, convenient reasoning, guesses. It’s much harder to assert that a relationship between two things really exists.[…]

There are really big problems in education, health, and energy that could benefit from advanced machine learning techniques made possible by the suite of so-called big data technologies. Why is it possible to solve them now? Why aren’t they solved yet? It’s because technologies for distributed storage and processing, pioneered and open-sourced by companies like Google, are available. Distributed computing systems, like but not restricted to ‘the cloud’, have brought down the cost of such operations. Finally, enough people have spent enough time trying, failing, and succeeding, to be able to use those technologies successfully. In other words, there are very good physical technologies and social technologies that are now in place. Ontario has very huge, centralized, repositories of very[…]

I’m taking it easy on the twitter account – so: Here are the nuggets: On Web Mining: The web is an infinite series of edge cases. The Robots.txt is not a terms of use document. Scraping should be done ethically, respect the robots.txt, respect their rate limits, be transparent about who you are and why you’re taking data from them. I was certainly knowledgeable about the practical difficulties of mining the web. I have a much greater appreciation for just how tough mining the web is coming out of this session. The fact that there are so many edge cases makes writing defensive code extremely difficult. Predictive Models: The complexity of the model needs only to be proportional to the[…]

Open data and surveillance are related. Surveillance means to watch over. We usually think of surveillance not being consensual, but it can be. For instance: The surveillance society is most obvious in Great Britain, with its CCTV network. There have been rows in Canada between citizens and governments. Toronto’s transit authority, the TTC, installed cameras in streetcars. Toronto citizens then turned their camera phones on TTC workers – sleeping and engaging in other bad behavior. There have been recent troubles in Ottawa about warrantless Internet wiretapping. Publicly available information about the minister responsible data was published in direct response. Surveillance is traditionally thought of as a government policy instrument used to protect various groups, including itself, from threats. Surveillance is[…]

Last night, some 400 others and I attended the Big Data Camp, held the night before the #strataconf kicks off. I saw: The Bay Areas version of Matt Milan run an unconference with 400 participants, and manage it very, very, well. A group of engineers argue that the definition of real time varied by the application, and that the issue of time-data reconciliation in a cluster nodes, the universal clock problem, could be avoided by building systems that avoid imposing a single definition of time altogether. Justin from BigML use one of his beautiful predictive models. I heard: Meaningful jargon. Words meant what I knew them to mean. Repeated distinctions between transactional data and sensor data. (For some reason?) No[…]

Sam Ladner, one of the best minds in real, actual, consumer ethnographic research, and author of one of the most influential theses on how we think of time in digital agencies, wrote an excellent comment on a previous post “Business Intelligence is not Data Science“. Sam wrote: “Curious and curiouser! You have taken up the sword “of the customer,” just as many other disciplines have claimed to do in the past (e.g., design, marketing, market research, “social” business). I find it quite interesting that everyone is clamouring all over each other to claim that no REALLY they CARE about the “customer” and those other disciplines do not. What is going on here? Why don’t we just say that capitalism has[…]

I’m doing a presentation on the Data Science of Marketing Analytics at the Strata Conference. It’ll be on Wednesday, February 29th, at 4:50pm in Mission City B4. The content has been optimized for a 4:50pm time slot. I’m telling a real story about a real data science project. It’s the newest presentation out there. I know, I just invented it.