The New York Times Annotated Corpus

December 4, 2008

The New York Times released an annotated news corpus about two months ago.  Evan Sandhaus, one of their search experts, gave a presentation to the New York Semantic Web Meetup group on Dec 4, although I was unfortunately out of town and missed it.  1.8 million articles over 10 years, tagged for persons, places, organizations, and topics “using a controlled vocabulary that is applied consistently across articles”.

Its a great opportunity for machine learning, with a few caveats.  It covers ten years, up to July 19, 2007.  News topics aren’t terribly stable.  Politics today involves different issues than it did ten years ago.  Of course, training over ten years is pretty good.  If you were to identify terms and concepts that composed Politics as a topic, you would find that it varies on many different timescales.  But classifiers are static.  It is therefore possible to train on too wide a date range.  Imagine for example training on all 160 or so years of the history of the New York Times.  The very language will have changed beneath your feet.  You might well extract language elements that are stable.  The White House after all is still the “White House”.  However you will have traded short-term accuracy for long-term stability.  You would have been better off training over only the last few years.

As a rule of thumb, if you want a classifier good for the next N years, train it over the last N years, or something on that order.  If you can weight the training items, do so.  If I could afford to retrain topics every month, I’d only train them on a month or so of data [well, a few months, when considering how stable most news topics are.  continuous retraining would be nice of course].  That at least is my off-the-cuff analysis.  It would be an interesting to put it to the test.

Anyhow, its great to see the New York Times release this data.  About two years ago, they showed me a couple of huge tomes that contained the same information, indexed in various ways, and I began to salivate uncontrollably.  They have a crew of librarians that do this for all of their content.  I suppose, if you aspire to be the paper of record, you can justify such an expense.  I’m not sure if its a canny long-term investment in their reputation, or simple corporate largess, but it reminds me of the tank-like phones that Ma Bell used to make when they were a monopoly, built to 5 nines reliability like the rest of their system.  The telephone was a public service and they wanted it to always be there, even if the rest of your power went out.  If you watch movies from a few decades ago, you’ll recognize them.  They could be used to bludgeon someone to death, and still work fine for calling the police.

That’s why, of all the newspapers falling on hard times, the New York Times worries me the most.  Few papers would make this sort of contribution, or so thoroughly index their content.  And as for my personal news-reading proclivities, I love the internet, but sometimes I want a paper that’s solidly written, and that I can roll up and use to bludgeon someone.


The New York Times launches an API

October 15, 2008

The New York Times just launched their first API.  You can get campaign finance data and movie reviews.  The campaign finance API is based on data from the Federal Election Commission, which you can already get online elsewhere.  So the information itself is not big news.  But information is valuable in direct proportion to its accessability, and that’s why this is so great.  This is the same principle that drives the GDP multiplier for the link economy.  If you follow their site, you probably have noticed that the Times is good at pulling information into engaging and informative flash widgets and interactive maps.  Having the data available to the public through an API lets do the same thing with the campaign finance and movie datasets.  The ways in which you can query the data aren’t as numerous as if you had it all housed in a relational database on your own server, but you have to make some accomodations for scalability.

The movie API is also nice, although it has a similarly narrow focus.  Based on earlier statements I expect then to launch a few more separate APIs, for things such as restaurant reviews, local listings, and recipes.  Of course I’d prefer if they released it all at once, under a uniform API.  But I hope these go well, since I would love to see all of their content available through it.  About a year ago they also pondered offering a search API, but I have heard nothing about it since.  I think they don’t realize how cruelly they tease us.  

At Daylife we recently launched a service to help publishers do just this.  Our CEO Upendra Shardanand blogged about it in more detail.  The Times has a large and talented group of technologists at their disposal, so they can go it alone for a project like this.  Rolling out an API calls for a different set of skills and has different infrastructure requirements than a news portal or web site, and not all publishers will be able to do it successfully.  Those that do have the technical skills won’t be able to do it as quickly.