Big Data Week Toronto; Big Data long in n and m

Posted onApril 21, 2013 by Christopher Berry

This is Big Data Week in Toronto. I’ll be delivering a case study on the business value of that data, but on a rather small, but beautifully complex, dataset on Monday.

Big Data has now just become a marketing term. Those who have put in the effort, and read the three or four HBR articles on the subject, know more than 80% of the population. If you’ve read up on some of the applications involved, you’re ahead of 95%. If you read this, you’re ahead of 99.99% of the population. So, there’s an incentive to read on.

What is Big Data?

A good definition of Big Data is anything that is generally too big to fit in the memory of a single computer. There’s a set of skills between working from your desktop, in an environment there – maybe a few terabytes of hard disk memory and a few gigabytes of random access memory (RAM), and there’s another set of skills involved in working with a cluster of computers. It’s not better or worse. They’re just different skills. Ego complicates matters quite a bit.

There’s considerable ego involved in stacking up the different sizes of data. Amazon has far more data than Netflix. Google and Facebook are giants in their own right. Walmart, AT&T, and several levels of government have datasets that fall into the Big Data category. And it generally follows a power law function from there. Where you work colors what you see as Big Data. If you’re in the top 1%, you don’t consider 99% of the firms out there to have Big Data. If you’re in the top 5%, you don’t consider 95% of the firms to have big data. If you’re in the top 20%, you don’t consider 80% of the firms out there to have big data. And so on. Even the people with small data see big opportunity in it.

n and m’s

Data can be big in two dimensions. There are dozens of other dimensions, a few of which I believe are salient to decision making. When it comes to the analysis of data, the two dimensions that matter most are n and m.

Big-n datasets are those that contain loads of observations. There are several billion people on earth. A table that large could be processed on a single laptop. It wouldn’t be pleasant, but it could be processed. If you had the first name of every single person on Earth, you’d a list of ~6.98 billion in length. Technically, I’d say that n is 6.98 billion.

There are 3.15569e7 seconds in a year. So, a single sensor that returns a ‘1’ indicating that it’s turned on, will generate a table that is 3.15569e7 long.

Data can grow to very large n, very quickly.

The number of columns, or features about n, in that dataset is called m. So, if you had the first name and date of birth of every person on earth, you’d have a dataset that’s of size 2 in m, and of size ~6.98 billion in n.

One of the amazing things about m is just how responsible it can be to imagination. What can I do with an m of 2, after all? I can add a whole bunch more m out of it. I can separate out the birth year. I can match that birth year with the year of the zodiac. I can match the birthdate to the Zodiac. I can run some language identification against the name, and estimate what language it’s from. A lot can be done with just two columns of data. In fact, m can expand to infinity.

It might not be useful for most problems. But, it’s possible.

Strength in Unity; Augmenting m

Some firms have very large n. And, they have several sets of very long n, and fairly small m. And sometimes, melting some nxm sets with other nxm sets doesn’t work. Sometimes it takes a lot of effort. And sometimes, when it does happen, amazing things happen.

Big, nasty data that is high in m, but perhaps shallow in n, could still be considered to be Big Data. It just depends on where you stand.

Data that is meaningfully big in m can be, depending on who’s asking the question and what the question is, tremendously powerful.

Combining and recombining nxm datasets can be tremendously processor intense, yet have just as much potential as big-n datasets.

Big data may just be a marketing term; but we are realizing the potential

There’s still so much actual potential left in the activation of data, regardless of the hype. There is real value there.

Instead of turning off, in response to the hype, maybe a better course of acton is to share more stories about it.

If you’re in Toronto and coming out, have a great time.