Deriving structure from subjectively unstructured data using straight content analysis with crowdsourcing methods may suffer from unintended methodological bias owing to competency (or even polarity) within crowds.

Question: Which late night comedian monologue is funnier and why?

• Assume a corpus of 5,000 timestamped audio* quotes drawn from episodes of Craig Ferguson, Conan, Jay Leno, Jimmy Kimmel, and Letterman. (Assumption: the quotes are drawn randomly from the most recent year, and that a random selection is representative of the total performance.)
• Assume the dependent variable is ‘funny’.
• Assume the independent variables are show, day, time-into-show, duration-of-clip, sight-gag, and pause duration.

Some machines are excellent at analyzing audio files, and all those independent variables are coded without methodological bias. The machines work.

What of the concept ‘funny’?

A machine can understand ideas in physics pretty well, but it doesn’t know funny. That concept has to be coded by humans, and, maybe those codes can be used to supervise a machine later.

So, we program mechanical turk and pay ~30,000 humans to listen to a subset of ~25 audio quotes and have people rate what they hear as being funny on a scale from 1 to 11.

Take the average score of the 1000 quotes per comedian and you’ll find out who’s funnier.

To find out why, take the five independent variables, run statistical tests, frick around with R for a half hour to get Alpha to run, check for problems (is Craig penalized for airing disproportionately on Friday, a day that he’s really not funny?) and explain a part of the reason why some monologues are funnier than others.

Problem?

Is popular opinion of what’s funny right?

If we were to augment the dataset by asking the age/sex/location of those doing the scoring, we may discover that different jokes are funny to different subsets of people. In fact, we may find that those in non-Anglo American countries may find different comedians far funnier.

We might also decide to setup an expert panel of comedians to do some scoring, and then run a check on the comparative competence between the crowd and the various segments. It may be found that some segments would be what is considered incompetent. (Is that fair? Or is that some people simply have different senses of humor?)

What do you think?

Is popular opinion the best proxy?

So what?

The cost of using crowds of people to systematically structure the unstructured is all the dirty fingerprints we leave behind as people.

We like to think that if there are hundred fingerprints, the randomness of people just blend away so that it’s a non-issue. Let’s be skeptical of that. Let’s be aware that sometimes people have been eating cheetos, and they’re leaving very dirty fingerprints on the data as they structure it.

If you’re aware of potential bias, you can take steps to manage it.

(*Note: I would be cool to program a machine to recognize facial expressions and body language in video and code those too as explanations. That said, I don’t know just how flexible surveillance software is in terms of recoding for performances. The reason for excluding John Stewart and Colbert is that their opening monologues make extensive use sight gags – the other five do not rely on them as much. Recognition, including comedian identification, is a problem. It’s a big digression that I chose to leave out of the main body. But, to address several excellent data scientists, computer vision has a major contribution to make.)

***

I’m Christopher Berry.