The New York Times released an annotated news corpus about two months ago. Evan Sandhaus, one of their search experts, gave a presentation to the New York Semantic Web Meetup group on Dec 4, although I was unfortunately out of town and missed it. 1.8 million articles over 10 years, tagged for persons, places, organizations, and topics “using a controlled vocabulary that is applied consistently across articles”.
Its a great opportunity for machine learning, with a few caveats. It covers ten years, up to July 19, 2007. News topics aren’t terribly stable. Politics today involves different issues than it did ten years ago. Of course, training over ten years is pretty good. If you were to identify terms and concepts that composed Politics as a topic, you would find that it varies on many different timescales. But classifiers are static. It is therefore possible to train on too wide a date range. Imagine for example training on all 160 or so years of the history of the New York Times. The very language will have changed beneath your feet. You might well extract language elements that are stable. The White House after all is still the “White House”. However you will have traded short-term accuracy for long-term stability. You would have been better off training over only the last few years.
As a rule of thumb, if you want a classifier good for the next N years, train it over the last N years, or something on that order. If you can weight the training items, do so. If I could afford to retrain topics every month, I’d only train them on a month or so of data [well, a few months, when considering how stable most news topics are. continuous retraining would be nice of course]. That at least is my off-the-cuff analysis. It would be an interesting to put it to the test.
Anyhow, its great to see the New York Times release this data. About two years ago, they showed me a couple of huge tomes that contained the same information, indexed in various ways, and I began to salivate uncontrollably. They have a crew of librarians that do this for all of their content. I suppose, if you aspire to be the paper of record, you can justify such an expense. I’m not sure if its a canny long-term investment in their reputation, or simple corporate largess, but it reminds me of the tank-like phones that Ma Bell used to make when they were a monopoly, built to 5 nines reliability like the rest of their system. The telephone was a public service and they wanted it to always be there, even if the rest of your power went out. If you watch movies from a few decades ago, you’ll recognize them. They could be used to bludgeon someone to death, and still work fine for calling the police.
That’s why, of all the newspapers falling on hard times, the New York Times worries me the most. Few papers would make this sort of contribution, or so thoroughly index their content. And as for my personal news-reading proclivities, I love the internet, but sometimes I want a paper that’s solidly written, and that I can roll up and use to bludgeon someone.
RSS
December 12, 2008 at 6:44 pm
ha! love the punch line.