Error happens on both sides in sentiment analysis / opinion mining.

Assume you wanted to understand, quantitatively, if the books published by Penguin are better received/perceived by customers than say, DoubleDay. One way to crack that problem would be to go to and mine the opinions. It stands to reason that books published by one company would receive a higher weighted average star rating, wouldn’t they? But what if that was inadequate, and you wanted to understand the general mood and tone of what was being actually written – presumably so you could learn and adjust? What then?

Assume extraction and a well-formatted file. In other words, assume a dataset.

Consider two machines, m and n, that operates on that dataset. They use two different algorithms for summarizing that data into easily digestible dimensions that are actionable. (What are those? That’s subject to another post.)

Machine m is a rather naive algorithm. It takes into account keyword group frequencies and then buckets them. It’s also the most scalable and chews through the dataset in just a few seconds. It’s not taking into account kth nearest neighbor and doesn’t make use of Machine Learning (ML) algorithms. Machine m would generate a lot of false positives as a result: it would assume that the sentence “I recommend this book for idiots and morons” is actually a good thing. It contains the cluster “I recommend” and “recommend” after all.

Machine n is more thorough at the cost of scalability. It’ll actually bust out sentences, tokenizes words, and classifies everything. In the attempt to rule out false positives, it would begin to generate false negatives. While it would certainly rule out “I recommend this book for idiots and morons”, it might also rule out “I recommend this book for the Queen West hipster”. (The latter being a totally reasonable recommendation.)

Error happens on both sides – in the attempt to pursue greater accuracy – one might actually be acquiring more error for relatively marginal gains. (And I don’t believe that ‘error cancels out’. If you’ve seen it happen, could you show me?)

One thought on “Error on Both Sides

  1. Matt Gershoff says:

    I assume you are referring to Precision and Recall? Perhaps you might expand on them and include a bit on the confusion matrix. By assigning a cost to each of the off diagonal cells, you can then estimate the value of each of your classification methods.



Comments are closed.