It is possible to derive quantitative measures of subjective concepts from volumes of unstructured data.

  • Some in political science use content analysis to quantify media or message bias.
  • Some in media studies or public relations use a variant of content analysis to measure bias.
  • Some in data science use a crowdsourced variant of content analysis to increase the number of features on unstructured data.

Let’s have some fun.

In what has become a staple paper “Unskilled and Unaware of It: How Difficulties in Recognizing One’s Own Incompetence Lead to Inflated Self-Assessments” (Kruger and Dunning, Journal of Personality and Social Psychology, 1999 (77), 6, pp. 1121-1134.), which you should absolutely read, the authors use a type of content analysis to quantify what is funny.

“We created a 30-item questionnaire made up of jokes we felt were of varying comedic value. Jokes were taken from Woody Allen (1975), Al Frankin (1992), and a book of “really silly” pet jokes by Jeff Rovin (1996). To assess joke quality, we contacted several professional comedians…and asked them to rate each joke on a scale ranging from 1 (not at all funny) to 11 (very funny). Eight comedians responded to our request… . Although the ratings provided by the eight comedians were moderately reliable (alpha = .72), an analysis of interrater correlations found that one (and only one) comedian’s ratings failed to correlate positively with the others (mean r = -0.09). We thus excluded this comedian’s ratings in our calculation of the humor value of each joke, yielding a final alpha of .76. Expert ratings revealed that jokes ranged from the not so funny (e.g. “Question: What is big as a man, but weights nothing? Answer: His shadow” Mean expert rating = 1.3) to the very funny (e.g. “If a kid asks where rain comes from, I think a cute thing to tell him is “God is crying”. And if he asks why God is crying, another cute thing to tell him is ‘probably because of something you did'”. Mean expert rating = 9.6.)” (p. 1123)

In other words:

  • Get some unstructured data to structure.
  • Get a group of people / experts to rate the data along some dimension.
  • Check for agreement.
  • Throw out the outliers.
  • Average their scores.

Agreement amongst observers is taken as being the best proxy for truth.

I’ll expand on that tomorrow.

Minutiae: Alpha is a measure of agreement amongst coders. You can read about it here. The R library is IRR.


I’m Christopher Berry.
I tweet about analytics @cjpberry
I write at