First Review of Zemanta

December 18, 2008

Zemanta is an interesting and rapidly improving startup that specializes in providing related content and links for blogs and articles.  They have nice demo that should make what they’re doing instantly clear.  What prompted this review is the recent release of an API.  It provides all of what you can see on their demo, making it easier for publishers to incorporate into their platform.  If you want to tinker with them, start with the demo.

The first thing I’ll say, with respect to their blogging tools, is that I love how they’ve worked humans into the loop.  Algorithms aren’t replacing humans anytime soon, so giving them an easy way to filter related content and highlighted phrases is a great move.  

As for the quality of their algorithms, I looked at entity and topic extraction, and their related articles.  For entity extraction, they follow a path similar to Evri, which I looked at earlier, in that they are leveraging Wikipedia.  A good example is Mike Wallace.  One of them is a journalist, and one of them is a NASCAR driver.  You can plug some text into their API, or into the demo page, and see for yourself that their algorithm defaults to the journalist.  If you start entering terms related to stock car racing, you still get the journalist.  However, if you mention NASCAR or the Daytona 500, the algorithm resolves to the NASCAR driver.  If you mention the Indianapolis 500 instead, you still get the journalist (the Indy is Formula 1, not stock car).  If you mention Rusty Wallace, the NASCAR driver’s older brother, GEICO his sponsor, or Dale Earnhardt, you still get the journalist.  Rusty and GEICO are both mentioned in Wallace’s Wikipedia entry.  So the terms used for disambiguation seem to be limited, but its pretty good.  Sampling a few articles, Evri does a better job resolving between the two Wallaces.  Its not an exhaustive test, but I rate Evri’s disambiguation as better than Zemanta.  However the types of entities and tags Zemanta extracts is more extensive.  They also offer a few nice touches like resolving “2008 European Tour” to “PGA European Tour”.   Evri doesn’t get either one.  Learning automatically to map the former to the latter would be tricky.  I suspect they’ve worked some humans into the backend somewhere, to help with disambiguation for certain entities or groups of entities.  I should also note that Evri just released a beta API.  Welcome to the neighborhood.  

I mashed up the API from OpenCalais and Zemanta, to compare raw entity extraction.  OpenCalais is a bit of a different beast.  They only extract raw entities, and do not for example merge “George W. Bush” with “President Bush”, or attempt to disambiguate entities with the same name.  I don’t expect anyone out there to beat them, but its a good benchmark.  Zemanta performs well on political news, but is fairly weak with sports news.  On average I would say they pick up half to two-thirds of the entities that OpenCalais does. Zemanta does a better job on non-entity topics like “Auto Racing”.  OpenCalais has some things like “industry terms” and general topics, but it is not as comprehensive.  

For related content, I put them through a quick comparison with Sphere.   Evri doesn’t offer this feature [ed- my mistake, they do, and have for some time.  See their demo site, or their API].  You can check Sphere’s version of related articles through a demo page on their web site.  Its a bit apples-to-oranges, since Sphere is mostly finding related blog content, targeting blog posts as close to the source article as possible.  Zemanta is pulling mainstream news, and aiming for articles that are similar but not for example near-identical wire articles put out by different organizations.  Sphere works well for major news articles or topics that are getting decent coverage in blogs, but tends to either gets very close or miss the mark wildly.  Its a common problem for similarity measures with high dimensionality.  Zemanta is a bit smoother.  I suspect they’re using entities to help with identifying related content.  I find their results more interesting, but its a subjective call.  I never trust bloggers such as myself, and find them frequently unreliable.

That’s not all that Zemanta does.  I haven’t touched on photos or suggested links.  When you compare how much they’ve done on the money they’ve raised to that of a company like Inform, which has gone through a few rounds now, its a nice accomplishment.  So clearly they have some talented folk working for them.  I’ll be following them closely.