New York Times Article API Offering Faceting Search

February 4, 2009

The New York Times released an article search API today, and not just any article search API, but an excellent one.  You can filter results by a large number of facets.  For example, articles about green energy from the business section that mention Google:

http://api.nytimes.com/svc/search/v1/article?query=green%20energy%20nytd_section_facet:[Business]%20org_facet:[GOOGLE%20INC]&api-key=[your-api-key-here]

You can also request facets and counts for the search query:

http://api.nytimes.com/svc/search/v1/article?query=mortgage%20crisis&api-key=[your-api-key-here]&facets=des_facet,per_facet,org_facet

The API is in fact more capable than their site search, which although in the advanced mode will show facet counts at the bottom, is not as flexible in allowing you to progressively filter by multiple facets, or restrict the types of facets returned.

There are some gaps.  The facets are sometimes ambiguous, for example if you search for “George Bush”, you will get a facet for “George W. Bush”, and one for “George Bush”, with about equal counts, so although the vocabulary is controlled, it is not perfect.  There are several API providers who would not let an ambiguous facet like “George Bush” through.  There are also no provisions for associating a person or organization facet with any external sources like DBPedia, which Calais and Zemanta both do.

They’ve been cranking these out once every month or two, but this is the first one that opens up their own content.


First Review of Zemanta

December 18, 2008

Zemanta is an interesting and rapidly improving startup that specializes in providing related content and links for blogs and articles.  They have nice demo that should make what they’re doing instantly clear.  What prompted this review is the recent release of an API.  It provides all of what you can see on their demo, making it easier for publishers to incorporate into their platform.  If you want to tinker with them, start with the demo.

The first thing I’ll say, with respect to their blogging tools, is that I love how they’ve worked humans into the loop.  Algorithms aren’t replacing humans anytime soon, so giving them an easy way to filter related content and highlighted phrases is a great move.  

As for the quality of their algorithms, I looked at entity and topic extraction, and their related articles.  For entity extraction, they follow a path similar to Evri, which I looked at earlier, in that they are leveraging Wikipedia.  A good example is Mike Wallace.  One of them is a journalist, and one of them is a NASCAR driver.  You can plug some text into their API, or into the demo page, and see for yourself that their algorithm defaults to the journalist.  If you start entering terms related to stock car racing, you still get the journalist.  However, if you mention NASCAR or the Daytona 500, the algorithm resolves to the NASCAR driver.  If you mention the Indianapolis 500 instead, you still get the journalist (the Indy is Formula 1, not stock car).  If you mention Rusty Wallace, the NASCAR driver’s older brother, GEICO his sponsor, or Dale Earnhardt, you still get the journalist.  Rusty and GEICO are both mentioned in Wallace’s Wikipedia entry.  So the terms used for disambiguation seem to be limited, but its pretty good.  Sampling a few articles, Evri does a better job resolving between the two Wallaces.  Its not an exhaustive test, but I rate Evri’s disambiguation as better than Zemanta.  However the types of entities and tags Zemanta extracts is more extensive.  They also offer a few nice touches like resolving “2008 European Tour” to “PGA European Tour”.   Evri doesn’t get either one.  Learning automatically to map the former to the latter would be tricky.  I suspect they’ve worked some humans into the backend somewhere, to help with disambiguation for certain entities or groups of entities.  I should also note that Evri just released a beta API.  Welcome to the neighborhood.  

I mashed up the API from OpenCalais and Zemanta, to compare raw entity extraction.  OpenCalais is a bit of a different beast.  They only extract raw entities, and do not for example merge “George W. Bush” with “President Bush”, or attempt to disambiguate entities with the same name.  I don’t expect anyone out there to beat them, but its a good benchmark.  Zemanta performs well on political news, but is fairly weak with sports news.  On average I would say they pick up half to two-thirds of the entities that OpenCalais does. Zemanta does a better job on non-entity topics like “Auto Racing”.  OpenCalais has some things like “industry terms” and general topics, but it is not as comprehensive.  

For related content, I put them through a quick comparison with Sphere.   Evri doesn’t offer this feature [ed- my mistake, they do, and have for some time.  See their demo site, or their API].  You can check Sphere’s version of related articles through a demo page on their web site.  Its a bit apples-to-oranges, since Sphere is mostly finding related blog content, targeting blog posts as close to the source article as possible.  Zemanta is pulling mainstream news, and aiming for articles that are similar but not for example near-identical wire articles put out by different organizations.  Sphere works well for major news articles or topics that are getting decent coverage in blogs, but tends to either gets very close or miss the mark wildly.  Its a common problem for similarity measures with high dimensionality.  Zemanta is a bit smoother.  I suspect they’re using entities to help with identifying related content.  I find their results more interesting, but its a subjective call.  I never trust bloggers such as myself, and find them frequently unreliable.

That’s not all that Zemanta does.  I haven’t touched on photos or suggested links.  When you compare how much they’ve done on the money they’ve raised to that of a company like Inform, which has gone through a few rounds now, its a nice accomplishment.  So clearly they have some talented folk working for them.  I’ll be following them closely.


The New York Times launches an API

October 15, 2008

The New York Times just launched their first API.  You can get campaign finance data and movie reviews.  The campaign finance API is based on data from the Federal Election Commission, which you can already get online elsewhere.  So the information itself is not big news.  But information is valuable in direct proportion to its accessability, and that’s why this is so great.  This is the same principle that drives the GDP multiplier for the link economy.  If you follow their site, you probably have noticed that the Times is good at pulling information into engaging and informative flash widgets and interactive maps.  Having the data available to the public through an API lets do the same thing with the campaign finance and movie datasets.  The ways in which you can query the data aren’t as numerous as if you had it all housed in a relational database on your own server, but you have to make some accomodations for scalability.

The movie API is also nice, although it has a similarly narrow focus.  Based on earlier statements I expect then to launch a few more separate APIs, for things such as restaurant reviews, local listings, and recipes.  Of course I’d prefer if they released it all at once, under a uniform API.  But I hope these go well, since I would love to see all of their content available through it.  About a year ago they also pondered offering a search API, but I have heard nothing about it since.  I think they don’t realize how cruelly they tease us.  

At Daylife we recently launched a service to help publishers do just this.  Our CEO Upendra Shardanand blogged about it in more detail.  The Times has a large and talented group of technologists at their disposal, so they can go it alone for a project like this.  Rolling out an API calls for a different set of skills and has different infrastructure requirements than a news portal or web site, and not all publishers will be able to do it successfully.  Those that do have the technical skills won’t be able to do it as quickly.