I’m taking it easy on the twitter account – so:

Here are the nuggets:

On Web Mining:

  • The web is an infinite series of edge cases.
  • The Robots.txt is not a terms of use document.
  • Scraping should be done ethically, respect the robots.txt, respect their rate limits, be transparent about who you are and why you’re taking data from them.

I was certainly knowledgeable about the practical difficulties of mining the web. I have a much greater appreciation for just how tough mining the web is coming out of this session. The fact that there are so many edge cases makes writing defensive code extremely difficult.

    Predictive Models:

    • The complexity of the model needs only to be proportional to the complexity of the problem.
    • Producing random trees and generating a forest is a good way to produce a model without systematic error.
    • Keep the trees shallow. 

    The decision tree is a much maligned algorithm. It went out of style in the late nineties. However, this idea of generating a random rain forest is pretty attractive. It reduces the risk of systematic errors creeping in, and, it self-handles having a cross-validation and testing set, which are very big advantages.

      Key takeaways from hall way sidebar conversations:

      • Expect a lot more talk about trees in the next few months.
      • Hiring headaches proliferate in multiple markets – from SF to Dublin.

      Shameless plug ahead:

      If you’re at Strata, come see Data Science in Marketing Analytics.

      It’s a war story and case study on applied data science. It’s where theoretical rubber hit the road.

      Wednesday, February 29th, at 4:50pm in Mission City B4.

      Find Truths, Repeat Them Endlessly


      I’m Christopher Berry.
      I tweet about analytics @cjpberry
      I write at christopherberry.ca