Active learning with SVMs for imbalanced datasets

August 20, 2008

I went to a great talk recently by Michael Bloodgood, a PhD candidate at the University of Delaware. Just up the street from us on Broadway is an NYU computer science group that periodically has talks open to the community, and frequently the topic is right up our alley. If your near New York and interested in natural language processing, they operate a mailing list that can keep you updated on events.

Anyhow, the future Dr. Bloodgood is working on some intesting and very practical ways to economically train support vector machines.  One of the data sets he worked with involved assigning news articles to topics, something we do here.  Topics can’t feasibly be defined automatically, you need some external authority.  You can I try to extract topics from a set of documents ab initio, but there are no clear boundaries, labeling is difficult, and you don’t know which groups are interesting.  So you need humans in there somewhere.

So imagine we want a topic on Middle Distance Running.  A common way of getting one is to use some humans to make some yes/no determinations on whether articles belong to the topic.  How do you select what articles to present to the humans?  When do you stop sending them?  Pretty basic questions, but ones that are only recently being answered.  Scoring articles can be an onerous task, and depending on the topic, it could take a lot of them, or only a few.  “Middle Distance Running” only has a few events, with a relatively distinct vocabulary.  “The Environment” is much more nebulous, and will take more training to nail down.

Bloodgood’s proposal is to stop training when the prediction becomes stable.  The stability is measured using a randomly selected set of items that do not need to be labeled.  Furthermore, the ratio of positive to negative training examples (positive amplification) is fixed based on an early estimate of the ratio of positive to negative items in the population.  The training examples are selected from points that are close to the hyperplane that separates positive and negative assignments by the support vector machine.  Comparing with a few other stopping criteria and positive amplifications, he gets measurably better results.

Using humans to train is expensive, whether you’re paying for it, or utilizing limited free resources.  You can imagine a web site where human feedback is used to refine a topic, either through submission of content, or removal of off-topic content.  If all you had were positive examples that humans submitted, you’d use a different method than the SVM method Bloodgood talks about.  If negative tagging (removing off-topic articles) is possible, you can imagine seeding an SVM with a search term, and presenting articles that the classifier tells you belongs to the topic, and a few that are there for training purposes to see if anyone rejects them.  You know many of the training articles will be off-topic although relatively close, but you take the data quality hit for the sake of long-term improvements.  At some point, you shut them off, based on the stopping criteria (the predictions become stable).  Not perfect, since only some of the off-topic articles will be rejected, and there will be some bias.  And the prediction will be stable only if the criteria used by your users is also stable.  However I would bet that it would work fine.  Leaning on users to maintain topics in this fashion isn’t on the Daylife roadmap, since we’re more about providing DayPI customers with the ability to own and maintain their own topics, but perhaps some day, if there is sufficient interest.

So the research presented in the talk was very practical, and it will help reduce the cost associated with training certain types of articles.  If you happen to be seeding the SVM with search terms, or if there is some non-random chance that an article chosen for annotation won’t in fact be annotated as with the community-driven topic, I would still trust the stopping criteria and positive amplification selection presented in his work, unless I’m missing something.