New York Times Article API Offering Faceting Search

The New York Times released an article search API today, and not just any article search API, but an excellent one.  You can filter results by a large number of facets.  For example, articles about green energy from the business section that mention Google:

http://api.nytimes.com/svc/search/v1/article?query=green%20energy%20nytd_section_facet:[Business]%20org_facet:[GOOGLE%20INC]&api-key=[your-api-key-here]

You can also request facets and counts for the search query:

http://api.nytimes.com/svc/search/v1/article?query=mortgage%20crisis&api-key=[your-api-key-here]&facets=des_facet,per_facet,org_facet

The API is in fact more capable than their site search, which although in the advanced mode will show facet counts at the bottom, is not as flexible in allowing you to progressively filter by multiple facets, or restrict the types of facets returned.

There are some gaps.  The facets are sometimes ambiguous, for example if you search for “George Bush”, you will get a facet for “George W. Bush”, and one for “George Bush”, with about equal counts, so although the vocabulary is controlled, it is not perfect.  There are several API providers who would not let an ambiguous facet like “George Bush” through.  There are also no provisions for associating a person or organization facet with any external sources like DBPedia, which Calais and Zemanta both do.

They’ve been cranking these out once every month or two, but this is the first one that opens up their own content.

2 Responses to “New York Times Article API Offering Faceting Search”

  1. Derek Gottfrid Says:

    In the case above actually those are two different George Bush. The meta data set date back pretty far and long before we expected W. to be part of the controlled voc. The equal amounts portion is a bit misleading as we are still trying to perfect that portion of the code. I am really excited to see people mention DBPedia, Calais, Linked Data – etc – because it fits with our long term thinking.

  2. Ken Ellis Says:

    Thanks for the input Derek. I suppose this is an excellent example of some long term problems with named entities and tags, and referencing external data sets. I noticed that even recent news items for H. W. are tagged with “George Bush”, a perfectly reasonable tag before W. stepped onto the stage. I suppose that for consistency across the archive, that tag was maintained. Calais solves this with their own meaningless URIs, and even if the name changes the URI can stay the same. But DBPedia’s URIs can be renamed and forwarded. I imagine that the Mike Wallace entry initially was for the journalist, and now it points to one of several people with that name. I also imagine company names are even messier.

    So it seems that, in comparison with the relatively young news startups doing similar things, you have some unique issues dealing with an archive that spans a quarter century. Not to mention dealing with legacy systems and formats. Hopefully I’ll be in the news business long enough to have such problems.

Leave a Reply