First Review of Zemanta

December 18, 2008

Zemanta is an interesting and rapidly improving startup that specializes in providing related content and links for blogs and articles.  They have nice demo that should make what they’re doing instantly clear.  What prompted this review is the recent release of an API.  It provides all of what you can see on their demo, making it easier for publishers to incorporate into their platform.  If you want to tinker with them, start with the demo.

The first thing I’ll say, with respect to their blogging tools, is that I love how they’ve worked humans into the loop.  Algorithms aren’t replacing humans anytime soon, so giving them an easy way to filter related content and highlighted phrases is a great move.  

As for the quality of their algorithms, I looked at entity and topic extraction, and their related articles.  For entity extraction, they follow a path similar to Evri, which I looked at earlier, in that they are leveraging Wikipedia.  A good example is Mike Wallace.  One of them is a journalist, and one of them is a NASCAR driver.  You can plug some text into their API, or into the demo page, and see for yourself that their algorithm defaults to the journalist.  If you start entering terms related to stock car racing, you still get the journalist.  However, if you mention NASCAR or the Daytona 500, the algorithm resolves to the NASCAR driver.  If you mention the Indianapolis 500 instead, you still get the journalist (the Indy is Formula 1, not stock car).  If you mention Rusty Wallace, the NASCAR driver’s older brother, GEICO his sponsor, or Dale Earnhardt, you still get the journalist.  Rusty and GEICO are both mentioned in Wallace’s Wikipedia entry.  So the terms used for disambiguation seem to be limited, but its pretty good.  Sampling a few articles, Evri does a better job resolving between the two Wallaces.  Its not an exhaustive test, but I rate Evri’s disambiguation as better than Zemanta.  However the types of entities and tags Zemanta extracts is more extensive.  They also offer a few nice touches like resolving “2008 European Tour” to “PGA European Tour”.   Evri doesn’t get either one.  Learning automatically to map the former to the latter would be tricky.  I suspect they’ve worked some humans into the backend somewhere, to help with disambiguation for certain entities or groups of entities.  I should also note that Evri just released a beta API.  Welcome to the neighborhood.  

I mashed up the API from OpenCalais and Zemanta, to compare raw entity extraction.  OpenCalais is a bit of a different beast.  They only extract raw entities, and do not for example merge “George W. Bush” with “President Bush”, or attempt to disambiguate entities with the same name.  I don’t expect anyone out there to beat them, but its a good benchmark.  Zemanta performs well on political news, but is fairly weak with sports news.  On average I would say they pick up half to two-thirds of the entities that OpenCalais does. Zemanta does a better job on non-entity topics like “Auto Racing”.  OpenCalais has some things like “industry terms” and general topics, but it is not as comprehensive.  

For related content, I put them through a quick comparison with Sphere.   Evri doesn’t offer this feature [ed- my mistake, they do, and have for some time.  See their demo site, or their API].  You can check Sphere’s version of related articles through a demo page on their web site.  Its a bit apples-to-oranges, since Sphere is mostly finding related blog content, targeting blog posts as close to the source article as possible.  Zemanta is pulling mainstream news, and aiming for articles that are similar but not for example near-identical wire articles put out by different organizations.  Sphere works well for major news articles or topics that are getting decent coverage in blogs, but tends to either gets very close or miss the mark wildly.  Its a common problem for similarity measures with high dimensionality.  Zemanta is a bit smoother.  I suspect they’re using entities to help with identifying related content.  I find their results more interesting, but its a subjective call.  I never trust bloggers such as myself, and find them frequently unreliable.

That’s not all that Zemanta does.  I haven’t touched on photos or suggested links.  When you compare how much they’ve done on the money they’ve raised to that of a company like Inform, which has gone through a few rounds now, its a nice accomplishment.  So clearly they have some talented folk working for them.  I’ll be following them closely.


Google Blog Search: Not just search anymore

October 2, 2008

Today Google launched a revamped version of its Blog Search, and for the first time, its not just search. They’re surfacing top blog clusters.  There are some interesting reviews here and here.  I have a different take on what they’re doing and why, and I think we’ll start seeing these clusters in a few other places around their web site, like their web search page.

Right now, they have a few high-level categories, and some simple clustering to surface major topics in the blogging world.  The clusters, like this one, are sized well, give you a nice chart, and have the usual blog noise, nothing too surprising.  Clustering isn’t trivial, but they already do a fine job with news, so I’m glad to see them port some of this over to the blog section.  Before this launched, it seemed like blogs weren’t getting much love, and not fully benefiting from the skills of their Google News team.  Its using a similar URL pattern for cluster id’s that Google News uses (“ncl” for news, and “bcid” for blogs — news cluster id, and blog cluster id, and the numbering scheme seems the same).  Blogs are still off in its own sandbox, and it seems that it would not be too hard to link major blog clusters to major news clusters.  I suspect that, relatively soon, we’ll start seeing exactly that:  blog clusters on their News section, and on their main search page.  You can look at blog posts from within a news article cluster, but it seems to have no relation to what’s going on with their Blog search page.

Could they have included blog search results on their main web search page now, and more heavily in their news site?  I suppose, but blog clustering makes it easier.  Much of what Google does incorporates page rank, and blogs and news aren’t as amenable to that sort of treatment, and splogs can gum things up.  Clustering however is a good way to surface things that are new and significant, you just have to pick good representatives from the cluster when you decide on a title and who to link to.  You also have the problem that, in relying on clustering, you will always be lagging others, since you have to wait until the momentum has already developed.  

Right now, if you use Google to search the web for Gwen Ifill, the first result says “News results for gwen ifill”.  But if you search for Geraldo Rivera, it doesn’t have a news slug, and just links to Wikipedia.  Perfect, Geraldo isn’t big in the news right now, but Gwen Ifill is.  How did Google know that?  Clustering!  Now that they have clusters for blogs, they can do the same with them.  But they need to be cautious with their web search, so I suspect they’ll let blog clustering run for a while before incorporating it.  Before that happens, we’ll probably see them rolled out to Google News, perhaps associated with news clusters.  

Significantly, actually searching for blogs does not leverage any of their clustering work.  At the time I’m writing this, the top blog cluster is one on Gwen Ifill, but searching for “Gwen Ifill” doesn’t show the cluster.  So the underlying search seems to be unchanged.  Cluster size however can help a great deal when sorting search results.  Why aren’t they leveraging clusters with their search?  There’s probably a technical hurdle there, or perhaps they just need time to back-process all of the blogs back to 2005.

They also have an API that lets you tell them about a new blog post.  Great.  Its not like they really need it, they’re Google after all.  But its a friendly thing to do, and it makes bloggers feel like they’re a bit more in control and that Google is working with them.


First Look at Evri, News Aggregator

September 24, 2008

Evri took the wraps off of their site today, and the so far it looks good.  Their angle seems to be establishing connections between people, products, places, and things, what in the business are called named entities, and creating a browsable experience.  Compared to others in that market, they are doing a nice job.  Its something that Silobreaker has been doing for a fairly long time, without significant traction.  Silobreaker has a broader range of analytics, but no nice widgets and entity disambiguation, but Evri’s chance at success seems better for several reasons that are detailed below.

A good starting point for Evri are their profile pages.  Start by looking at the Mike Walker page.  That page lists out articles for any Mike Walker, and offers you a selection of several individuals with that name.  Five “Mike Walker” pages are listed, but from their person finder menu I see the following six:

Mike Walker, Author and Journalist
Mike Walker, Coach and Soccer Player
Mike Walker, Football Player
Mike Walker, Musician
Mike Walker, Playwright
Mike Walker, Football Player

These map directly to Wikipedia entries, with Wikipedia’s paranthetic notations converted to a more naturally readable form.  There are two “Football Player” entries because there were too football players with that name.  One is listed as “Mike Walker (American football)”, the other “Mike Walker (Canadian football)”.  The last one has no traffic.  

Mike Walker the playwright is a very interesting case.  One of the links is to this Wikipedia page that mentions but does not link to Mike Walker, but mentions “play” in the vicinity of his name.  Also on the playwright’s page is a link to this news article.  It clearly should have been linked to the football player, but you’ll notice that the word “play” appears fairly close to Mike Walker’s name.  The language of the document as a whole however is easily distinguished as a football piece and not one on plays or playwrights, except perhaps for a play about football.  

So here’s what I think they’re doing with topics.  They scrape Wikipedia for a bunch of people, and fit them into categories, perhaps a few hundred.  Football players, soccer players and coaches, musicians, playwrights, and so on.  For example, Joe Biden is listed as “lawyer and U.S. politician”, right out of the sidebar on his Wikipedia entry.  The playwright is filed under “Mike Walker (radio dramatist)”, but is assigned the category “British dramatists and playwrights”.  So you pidgeon-hole him in with all of the other playwrights.  This is a decent amount of manual work, but far more tractable than trying to do it all manually, and far more accurate than doing it all algorithmically.  In the case of playwrights, soccer players, and so on, you define some words that will tend to isolate the type of person.  For playwrights, the word “play”.  Score each Mike Walker entry based on the words in the vicinity of the name, and the words attached to the category of each candidate.   You get fairly good precision deciding between entities.  But sometimes word sense will mess you up, like the word “play” in the football article resulting in an assignment to the playwright.  Still, I think its a very nice solution.

The images and video are somewhat lacking.  You will notice that all Mike Walker’s have exactly the same images and videos, and in fact Silobreaker has exactly the same video’s as Evri, and the’re just about all unrelated and of poor quality, and that goes for Evri’s images as well.  That’s one of the problems with matching between data sets.  I hear the Semantic Web will eradiate this type of problem at some point in the distant future and bring peace to all the peoples of the earth.

The related content widget is nice, but mostly a nice UI for what they are already doing to support the profile pages, so I won’t discuss it in too much detail.  You stuff some text through your topic extraction pipeline, and use it to get related content.

Also, as a side note, I just want to say what an amazing thing Wikipedia is for getting sites like this off the ground.  So many of the companies in this field have used them, and its great to see all of the human effort curating that site get leveraged across the industry.

– Sept 26 –
I realized subsequently that they enumarate what they call “taxonomical paths”, which are the categories referenced above that likely help with disambiguating entities.  The list is available on their web site.  Great transparency.


Google Labs InQuotes Feature

September 24, 2008

The Google News group released a nice new InQuotes feature today.  Its a nice interface.  I wish we had done it here at Daylife, since our API would let you do exactly what they are doing.  

Quote extraction itself is not too difficult a problem, although attribution can be tricky.  If you want low recall, you can look for “‘blah blah’, said Sam Peckinpah”, and simple patterns like that.  The english language however admits to an enormous number of ways to attribute a quote to some, or to a pronoun or name fragment representing someone.  The other thing I’ll say about quote attribution: never get them wrong.  Or almost never.  That’s quote extraction and attribution in a nut-shell.  

Nice interface aside, the interesting thing here is that Google News did it.  Its the fun sort of thing I’d expect from a smaller more nimble company.  Like us.  You can’t take yourself too seriously if you want to put stuff like this out, and you have to weather complaints about excluding candidates.


Web Taxonomies: File under “D” for Dead.

August 22, 2008

We had an interesting discussion about taxonomies today, when they are useful, and whether they are useful at all.  Some of our competitors have invested heavily in them, and sometimes we run across a client that wants them.

I suppose first I should define what I mean by taxonomy when it comes to news.  It refers to placing articles or topics in a hierarchical structure.  For example, senators are a subgroup of politicians, are a subgroup of people.  An article on the war in Iraq might be filed under Conflicts, which itself is filed under International Affairs.  Each topic has at most one parent, and may have many children.

Most news sites and print publications have some structure by which it categorizes news.  The New York Times has a dozen or so sections, for Technology, Business, Sports, and so on.  But in common parlance most would not call this a taxonomy, and neither shall we.  Its a high-level classification, and once for example you dive into the Technology section, there are few if any subgroups.  Their Arts section for example is broken down into Books, Movies, Music, Television, and Theater.  All in all, there are only 25 groupings on their web site, of which 13 are top-level groups.  I suspect its a reflection of how a print paper might be organized to make it easier to read.  You want Business news?  Pull out the Business section, and you get a dozen or so pages.  From there, perhaps there are a few subsections that take up a page or two.

What most mean by taxonomy is best demonstrated by Inform.  If you go to their US Politics page, you have a couple of top-level categories, each of which has ten or so sub-categories.  If you go into one of these subcategories, you have further options for navigating further down the tree into additional subtopics.  For example, you can go from US Politics, to Great Plains States, to Kansas.  In this case, the hierarchy is three deep.  If each parent had ten child nodes, and everything went three deep, that would give them about a thousand topics.  I don’t know the exact count, but the point is its a lot, far more than the New York Times.

Is it useful?  Say I want news on the NBA.  In the case of Inform, I mouse-over Sports, click on Basketball.  Then I select the NBA subtopic.  Now I’m at the NBA page.  Or I enter “NBA” into the search box, and it takes me to the NBA topic page.  From there I can refine by different divisions.  Its an OK experience.

Say however I don’t have a taxonomy stored away in some database.  I enter NBA, and go to an NBA topic page.  Then I calculate on-the-fly what other topics happen to be related.  That fetches topics for players, teams, and perhaps for the past week the Beijing Olympics.  That calculation is based solely on what other topic assignments are made to the articles assigned to the NBA topic.  It will surface many of the same relationships that a taxonomy would, plus a lot of other things that it would not, like the Olympics where the US team and a number of NBA stars are competing.  Score one for the computer.

These on-the-fly relationships have some advantages, such as making associations based on what’s happening now in the news.  You can of course combine that with a taxonomy, but having two ways to navigate would probably be confusing — a rigid taxonomy, next to a more fluid relationship engine.  They are also easier to maintain.  Constructing and maintaining taxonomies are something that humans do.  I would rather have the humans maintain and create new topics, and let algorithms determine which are related.  Taxonomies are dead.  Many years ago, when web sites were less dynamic, and high-traffic sites couldn’t perform complicated calculations every time someone looked at it, they made some sense, you could define all of the relationships ahead of time.  So Moore’s Law killed it.  As a most prominent example, look at the Yahoo! Directory.  When was the last time you used it?  The human-edited taxonomy was their first offering, most now go to their Search feature.  Look also at Wikipedia.  There is a fairly light taxonomy for most topics.  If you go to the George W. Bush page, there are categories at the bottom that constitute a heirarchy.  Now tell me, do you ever use them?  I never have.

I have seen a few sites where a taxonomic structure might appear to be useful.  One of them, powered by Kosmix, is RightHealth.  It is a health-focused web site that for any topic had a list of related drugs and organizations.  If you go to the Diabetes page, on the right are related topics broken down by type, for example Drugs, Health Associations, and Diseases.  That could be accomplished by a taxonomy of entities (Products -> drugs), but would be a lot easier if you just used tags, and tagged certain entities as drugs.  You can pull in related topics tagged as drugs on-the-fly.  Not a taxonomy, there is no hierarchy.  Its just algorithms and tagging.

So taxonomies are dead.  Show me an example where they might be used, and I’ll show you a better way with on-the-fly relationships and a flat topic or tagging structure.  Counter-examples and arguments are welcome.