The New York Times Annotated Corpus

December 4, 2008

The New York Times released an annotated news corpus about two months ago.  Evan Sandhaus, one of their search experts, gave a presentation to the New York Semantic Web Meetup group on Dec 4, although I was unfortunately out of town and missed it.  1.8 million articles over 10 years, tagged for persons, places, organizations, and topics “using a controlled vocabulary that is applied consistently across articles”.

Its a great opportunity for machine learning, with a few caveats.  It covers ten years, up to July 19, 2007.  News topics aren’t terribly stable.  Politics today involves different issues than it did ten years ago.  Of course, training over ten years is pretty good.  If you were to identify terms and concepts that composed Politics as a topic, you would find that it varies on many different timescales.  But classifiers are static.  It is therefore possible to train on too wide a date range.  Imagine for example training on all 160 or so years of the history of the New York Times.  The very language will have changed beneath your feet.  You might well extract language elements that are stable.  The White House after all is still the “White House”.  However you will have traded short-term accuracy for long-term stability.  You would have been better off training over only the last few years.

As a rule of thumb, if you want a classifier good for the next N years, train it over the last N years, or something on that order.  If you can weight the training items, do so.  If I could afford to retrain topics every month, I’d only train them on a month or so of data [well, a few months, when considering how stable most news topics are.  continuous retraining would be nice of course].  That at least is my off-the-cuff analysis.  It would be an interesting to put it to the test.

Anyhow, its great to see the New York Times release this data.  About two years ago, they showed me a couple of huge tomes that contained the same information, indexed in various ways, and I began to salivate uncontrollably.  They have a crew of librarians that do this for all of their content.  I suppose, if you aspire to be the paper of record, you can justify such an expense.  I’m not sure if its a canny long-term investment in their reputation, or simple corporate largess, but it reminds me of the tank-like phones that Ma Bell used to make when they were a monopoly, built to 5 nines reliability like the rest of their system.  The telephone was a public service and they wanted it to always be there, even if the rest of your power went out.  If you watch movies from a few decades ago, you’ll recognize them.  They could be used to bludgeon someone to death, and still work fine for calling the police.

That’s why, of all the newspapers falling on hard times, the New York Times worries me the most.  Few papers would make this sort of contribution, or so thoroughly index their content.  And as for my personal news-reading proclivities, I love the internet, but sometimes I want a paper that’s solidly written, and that I can roll up and use to bludgeon someone.


Active learning with SVMs for imbalanced datasets

August 20, 2008

I went to a great talk recently by Michael Bloodgood, a PhD candidate at the University of Delaware. Just up the street from us on Broadway is an NYU computer science group that periodically has talks open to the community, and frequently the topic is right up our alley. If your near New York and interested in natural language processing, they operate a mailing list that can keep you updated on events.

Anyhow, the future Dr. Bloodgood is working on some intesting and very practical ways to economically train support vector machines.  One of the data sets he worked with involved assigning news articles to topics, something we do here.  Topics can’t feasibly be defined automatically, you need some external authority.  You can I try to extract topics from a set of documents ab initio, but there are no clear boundaries, labeling is difficult, and you don’t know which groups are interesting.  So you need humans in there somewhere.

So imagine we want a topic on Middle Distance Running.  A common way of getting one is to use some humans to make some yes/no determinations on whether articles belong to the topic.  How do you select what articles to present to the humans?  When do you stop sending them?  Pretty basic questions, but ones that are only recently being answered.  Scoring articles can be an onerous task, and depending on the topic, it could take a lot of them, or only a few.  “Middle Distance Running” only has a few events, with a relatively distinct vocabulary.  “The Environment” is much more nebulous, and will take more training to nail down.

Bloodgood’s proposal is to stop training when the prediction becomes stable.  The stability is measured using a randomly selected set of items that do not need to be labeled.  Furthermore, the ratio of positive to negative training examples (positive amplification) is fixed based on an early estimate of the ratio of positive to negative items in the population.  The training examples are selected from points that are close to the hyperplane that separates positive and negative assignments by the support vector machine.  Comparing with a few other stopping criteria and positive amplifications, he gets measurably better results.

Using humans to train is expensive, whether you’re paying for it, or utilizing limited free resources.  You can imagine a web site where human feedback is used to refine a topic, either through submission of content, or removal of off-topic content.  If all you had were positive examples that humans submitted, you’d use a different method than the SVM method Bloodgood talks about.  If negative tagging (removing off-topic articles) is possible, you can imagine seeding an SVM with a search term, and presenting articles that the classifier tells you belongs to the topic, and a few that are there for training purposes to see if anyone rejects them.  You know many of the training articles will be off-topic although relatively close, but you take the data quality hit for the sake of long-term improvements.  At some point, you shut them off, based on the stopping criteria (the predictions become stable).  Not perfect, since only some of the off-topic articles will be rejected, and there will be some bias.  And the prediction will be stable only if the criteria used by your users is also stable.  However I would bet that it would work fine.  Leaning on users to maintain topics in this fashion isn’t on the Daylife roadmap, since we’re more about providing DayPI customers with the ability to own and maintain their own topics, but perhaps some day, if there is sufficient interest.

So the research presented in the talk was very practical, and it will help reduce the cost associated with training certain types of articles.  If you happen to be seeding the SVM with search terms, or if there is some non-random chance that an article chosen for annotation won’t in fact be annotated as with the community-driven topic, I would still trust the stopping criteria and positive amplification selection presented in his work, unless I’m missing something.