First Look at Evri, News Aggregator

September 24, 2008

Evri took the wraps off of their site today, and the so far it looks good.  Their angle seems to be establishing connections between people, products, places, and things, what in the business are called named entities, and creating a browsable experience.  Compared to others in that market, they are doing a nice job.  Its something that Silobreaker has been doing for a fairly long time, without significant traction.  Silobreaker has a broader range of analytics, but no nice widgets and entity disambiguation, but Evri’s chance at success seems better for several reasons that are detailed below.

A good starting point for Evri are their profile pages.  Start by looking at the Mike Walker page.  That page lists out articles for any Mike Walker, and offers you a selection of several individuals with that name.  Five “Mike Walker” pages are listed, but from their person finder menu I see the following six:

Mike Walker, Author and Journalist
Mike Walker, Coach and Soccer Player
Mike Walker, Football Player
Mike Walker, Musician
Mike Walker, Playwright
Mike Walker, Football Player

These map directly to Wikipedia entries, with Wikipedia’s paranthetic notations converted to a more naturally readable form.  There are two “Football Player” entries because there were too football players with that name.  One is listed as “Mike Walker (American football)”, the other “Mike Walker (Canadian football)”.  The last one has no traffic.  

Mike Walker the playwright is a very interesting case.  One of the links is to this Wikipedia page that mentions but does not link to Mike Walker, but mentions “play” in the vicinity of his name.  Also on the playwright’s page is a link to this news article.  It clearly should have been linked to the football player, but you’ll notice that the word “play” appears fairly close to Mike Walker’s name.  The language of the document as a whole however is easily distinguished as a football piece and not one on plays or playwrights, except perhaps for a play about football.  

So here’s what I think they’re doing with topics.  They scrape Wikipedia for a bunch of people, and fit them into categories, perhaps a few hundred.  Football players, soccer players and coaches, musicians, playwrights, and so on.  For example, Joe Biden is listed as “lawyer and U.S. politician”, right out of the sidebar on his Wikipedia entry.  The playwright is filed under “Mike Walker (radio dramatist)”, but is assigned the category “British dramatists and playwrights”.  So you pidgeon-hole him in with all of the other playwrights.  This is a decent amount of manual work, but far more tractable than trying to do it all manually, and far more accurate than doing it all algorithmically.  In the case of playwrights, soccer players, and so on, you define some words that will tend to isolate the type of person.  For playwrights, the word “play”.  Score each Mike Walker entry based on the words in the vicinity of the name, and the words attached to the category of each candidate.   You get fairly good precision deciding between entities.  But sometimes word sense will mess you up, like the word “play” in the football article resulting in an assignment to the playwright.  Still, I think its a very nice solution.

The images and video are somewhat lacking.  You will notice that all Mike Walker’s have exactly the same images and videos, and in fact Silobreaker has exactly the same video’s as Evri, and the’re just about all unrelated and of poor quality, and that goes for Evri’s images as well.  That’s one of the problems with matching between data sets.  I hear the Semantic Web will eradiate this type of problem at some point in the distant future and bring peace to all the peoples of the earth.

The related content widget is nice, but mostly a nice UI for what they are already doing to support the profile pages, so I won’t discuss it in too much detail.  You stuff some text through your topic extraction pipeline, and use it to get related content.

Also, as a side note, I just want to say what an amazing thing Wikipedia is for getting sites like this off the ground.  So many of the companies in this field have used them, and its great to see all of the human effort curating that site get leveraged across the industry.

– Sept 26 –
I realized subsequently that they enumarate what they call “taxonomical paths”, which are the categories referenced above that likely help with disambiguating entities.  The list is available on their web site.  Great transparency.


AdaptiveBlue SmartLinks

September 19, 2008

Its late on a Friday, and I’ve lately been mesmerized by the near collapse of the financial sector, but I thought I would toss out a quick review of AdaptiveBlue and a guess at what they’re doing under the hood.  Another entry into the increasingly crowded market of automatically marking up text with links.  

The product that is most interesting is their SmartLinks.  Automatically marking up blog entries, or decorating links with pop-up information and links, is not particularly easy.  Most that I see are doing some sort of entity extraction or topic identification.  That can run the gamut from numbingly complicated to numbingly simple.  On the complicated end, you can use computationally expensive statistical techniques that attempt to disambiguate different people with the same name, identify nicknames and shortened versions.  On the simple end, you can build up a list of names, say from a resource like Wikipedia as a reference, build a giant regular expression, and do string matching.  Massive regular expressions can run quickly, say 50 thousand names in a hundred milliseconds or so.  In either case, its a noisy process.  

AdaptiveBlue has a yet simpler method for decorating these links, and simple is good.  They don’t try to identify for example what words represent a book or person.  They look at the links a human has already added, and the target of the link.  If you link the name of a company to a finance site, they generate a pop-up with simple information and options for navigation.  Say someone is linking to stockpkr.com.  I prefer Yahoo Finance, so I’m going there.  One disadvantage is that it requires another click to get there, but for certain narrow sites I can see an advantage.  Movie reviews, music, or financial sites all seem like good candidates.

How do they do it?  Take links to financial sites as an example.  First identify a set of sites that financial bloggers like to link to.  Maybe you come up with a dozen.  When you see a link to one of those sites, there are a couple of strategies.  Grabbing the anchor text won’t always be illuminating, since you might just link to this lousy, wretched book without mentioning its name.  However, in the case of books, for the two sites they support the ISBN can be extracted from the target URL.  Given the ISBN, you can pull in related information like an image of the cover and a description from Amazon, and generate links out to other sites.  For books the links out need not be based on the ISBN, since you now know with complete certainty the title and author from Amazon.  If the ISBN isn’t in the URL, there’s a chance you can still scrape it from the destination page, or set up other extraction algorithms, although I’m not sure if they’re going this far.  As a last resort, you can use the anchor text, but as mentioned that won’t always be reliable.

This isn’t terribly earth-shattering.  They bust through some domains like music, finance, books, and hand-craft some retrieval rules.  ISBN for books, ticker symbols for stocks, with some more general methods to backstop if they prove to be of sufficiently high fidelity.  There are many companies out there in the business of marking up or otherwise decorating content with links and additional information.  Many of the opt for natural language processing techniques.  You can for example try to identify all of the Book titles from a blog post, and highlight and link them.  This is simpler, cleaner, will give you pretty good precision and recall.  The writer is in control over what gets linked and the target provides great context.  Is it generally applicable?  No.  Can you drop it onto any blog on any topic and get good results?  No.  It targets a few niches, and handles those niches well.  The development effort is relatively small.  That will carry the day over complicated general-purpose techniques.

The Firefox plugin I’m not so hot on.  Related pages from Sphere and Google are pretty simple to build in, they both have url’s that will fetch related content automatically that aren’t particularly helpful, or related sites which is usually even less helpful.  Saving and sharing aren’t things I typically do, and many sites have plugins to help with things of this nature.


Kosmix part II

September 10, 2008

We discussed in the earlier post how Kosmix might identify related topics.  Another of their nice features, and the one that in my opinion really makes them interesting, is that the third party modules included on the page depend on the topic.  Users also have the ability to indicate whether a particular module, for example photos from Daylife, are relevant to the topic.  (skip to the end where it says SUMMARY if you don’t want all the boring details)

They already have a database of topics and their Wikipedia category.  They also quite likely have the capability to assign modules on a per-topic basis, given that they are allowing users to tweak them.   

How do they decide what modules to include?  One could image setting up several initial module mixtures for political figures, movies, companies, and so on, and making a preliminary assignment to each topic. Since most of their topics are mapped to Wikipedia entries, that would be an easy way to bootstrap module assignment, since using Wikipedia’s categories reduces the size of the problem greatly.  Their categories aren’t great, its not intended to be a taxonomy or heirarchy, but could save some work.

You could also base it on attributes of the Wikipedia entry.  Even if the categories in Wikipedia don’t exactly match up to what you want, the entries have a uniformity that makes other types of classification easy.  People for example almost always have a birth date specified in the first paragraph which is easy to pick out.  It would also be easy to hand-craft rules to identify movies.  A few dozen of these, and you’re probably looking good.  

There are other ways of doing this of course, well into the region of diminishing returns.  Since, as per the previous post, they likely have a set of documents associated with the topic to facilitate exposing related topics, you might consider using a machine learning technique to identify various types of topics.  This could be trained using high-reliability assignments from Wikipedia, or through more manual means.  For example, if you were able to identify 500 entries that you knew were musical bands, you could use that information to train a simple classifier and make additional assignments.  But I would skip all of this machine-learning stuff.  A more direct question is, does the module have anything interesting for the topic?

As discussed before, their index might be relatively small and focused on certain sites.  Say IMDB is in their index.  If a topic hits with IMDB and scores high, identify it as a movie or actor, or just have a separate profile of modules for anything that hits with IMDB.  But most of the modules aren’t sites per se, but other search engines.  You wouldn’t build up a search index for MeeHive.  Assigning categories based on your index doesn’t seem like a great idea either, too noisy.  So I don’t think they’re doing any of this.

Here is a simple method.  Look at the results coming out of the third party searches, and use that to determine whether to include the module.  If you search YouTube for “george bush flagitious” and come up with nothing, or with low-quality video (old, few views, low ratings) don’t include the module.  The HTML being served up by their site already has the results baked into it — they aren’t scripts or widgets that populate separately.  So you can throw 50 modules onto the George W. Bush page, give each of them 1 second to respond, and display whatever gets back to you in time and seems to be of high quality.  Cache the page for a few minutes, so that popular ones load quickly.  It also gives you a way of populating modules intelligently even if someone searches for “george bush flatigious”.

That sounds easy, but not really.  Not all services will report a relevance score.  YouTube is easy, they’ll report number of views and ratings, and age.  So figuring out whether a module is returning good stuff might be difficult.  You can at least scan for the terms you put in.  If you can’t get a handle on relevance, you can push them further down on the page.  The George Bush topic page has a module for How-To videos at the bottom.  Not a huge deal, maybe someday users will help you clean it up.  You also don’t want the third party services to start hating you.  The cache will help there, but you might also want to record how often the results seem relevant, and use that to determine whether to even call it in the first place.  If How-To videos for Bush seem lousy, stop asking for them, but give them another chance in a day or two.

One other important data point, the Microsoft topic page has a couple of financial news outlets on it.  These outlets will also surely serve up hits for George Bush, and just about anything else, but they seem to be isolated to company topics.  So I’m fairly confident they are classifying topic pages and using that as a guide for module selection.  There’s also the issue of ordering the modules.  The Microsoft topic page again is very nice, has Wikinvest at the top, and other financial news services.  The Britney Spears page has last.fm near the top.  That’s probably not all on-the-fly based on what the modules are returning.

Their RightHealth site, which is also nice, is another example where the modules are a bit more selective.  I actually like it more than their main Kosmix site, cleaner and the modules are all appropriate.  But then its a smaller set of topics, and so easy to curate.

How do you handle searches?  If you search for “Microsoft Google” you don’t get all of those nice financial news sources.  It looks like they have a set of generic modules to fall back on if its a search as opposed to a topic.  That I think is an area for improvement.  At the least you could do string matches against your topics, and pull in modules that are indicated for either of them.  

SUMMARY

So what’s my best guess here?  What would I do?  Set up one or two dozen classifications for topics, and make preliminary assignments using hand-crafted rules referencing Wikipedia categories and entries.  I wouldn’t spend too much time doing it, or try anything tricky.  Perhaps just identify people, companies, and “other” as a broad backstop where a more fine-grained assignments aren’t necessary.  Have topic-specific module settings and ordering, set initially by classification and global orderings.

Develop an estimate of relevance for each of the third-party modules, and monitor it for every call.  The topic classifications seed what modules are used initially, and their order.  When a topic page is requested, call out to all of the modules, given them a second or two to respond, and prune ones that seem to have low relevance.  Stop asking them for a while if relevance is consistently low.  Monitor the modules closly to see if they’re low performers across the board, and if so reasses how you’re setting their bar for relevance.  From there, user input drives it.  For searches, scan the query string for topic names and use that as a guide for including modules.

There are a few too many modules for my tastes on most topic pages, and the ones towards the bottom of the page can get kind of weird, like how-to’s for George Bush, but its also a fairly new site and no doubt will improve.


Guess How It’s Done: Kosmix

September 8, 2008

Kosmix is a nice site. If you enter a search term, it will stitch together content from a variety of other sites, including Daylife in some cases. They also provide links on the upper right of the page to other Kosmix pages. So how does it work? (Skip to the end where it says SUMMARY if you don’t want boring details)

I like to start by looking at a few obscure examples. Fringe cases are sometimes good ways to probe black boxes. If you search for the word “profligate“, you’ll get two links to other topic pages. The first is a link to the Flagitious topic page, which appears under “Redirects to Wiktionary”. The second is a link to the “Christie Park (Stadium)” topic page, which appears under “Football Venues in England”. The “Redirects to Wiktionary” and “Football Venues in England” are category names right out of Wikipedia. So the first thing we notice is that the topics are at least in some cases taken from Wikipedia categories. The Christie Park entry is similarly interesting. Google’s search engine turns up seven hits for ‘”Christie Park” AND profligate‘. That’s not much at all given the size of the index.

(NB: don’t take this as a reflection on Kosmix’s data quality, I don’t really expect good topic pages for “profligate” and “flagitious”, that’s not the point here, we’re just trying to figure out how they work.)

The difference in case for “Stadium” in the topic name is interesting. Wikipedia changed it from upper to lower case on June 20, 2007, but Kosmix still has it lower case. So they pulled the topic names from before that change and have not updated it since.

So why does Christie Park turn up when I search for “profligate”? If we can answer that question, we’ve done a lot to expose Kosmix’s inner workings. Searching “Christie Park” would not single out the term “profligate”, nor would searching for “profligate” be strongly associated with “Christie Park”. The same is observed for the third-party search services they are pulling onto the topic pages. So the association between the two is a fairly low-probability event. They aren’t mentioned together in Wikipedia, even after scanning page histories.

The page has in the past linked out to other content. You might imagine scraping the Wikipedia entry for “profligate”, following external links, and using those to expose connections. Unfortunately I was unable to retrieve the content linked to by the page in the past, it is no longer available, so I can’t confirm this. So lets look at the other link on the page, “flagitious”. Here the Wikipedia page is very small, and profligate has never in its very brief history (it was deleted a while ago and redirected to Spendthrift, a poor choice). The Wikipedia page for “flagitious” furthermore has never listed “profligate” among its synonyms, nor linked to anything that might contain the word “profligate”.

Checking WordNet, a standard linguistics toolbox that developers might use in constructing a site like Kosmix, doesn’t associate “profligate” and “flagitious”. However, if we look at Webster’s Revised Unabridged Dictionary, we get a hit. It appears in a semicolon-delimited list of similar words for one particular sense of profligate that seems archaic. Note that the dictionary is from 1913 and is unabridged. It also appears in a Word of the Day entry.

How did they arrive at the connection between “profligate” and “flatigious”? Plowing through Websters Unabridged and constructing word relationships is one possibility, but I doubt it. Any reasonably competent practitioner of the art would be aware of WordNet and use it instead, and the relationship is not contained in WordNet. They may be using Wikipedia for topic categories and names, but they’re going to other sources to construct connections.

For the “flagitious” topic page, there are a few other connections that look like synonyms pulled from a dictionary. Miscreat, a synonym, is filed under “Comedy Film”. The semantics there are wrong. Another “Redirects from Wikipedia” entry, for “depraved”. And then there’s Tim Robbins, what is he doing under “flagitious”?

Lets also look at some of the skill sets behind Kosmix and how long they’ve been around, by digging up the LinkedIn profile for Gaurav Bhalotia. Three years old at least, about the same age as Daylife. Relational databases, information retrieval, nothing too exotic or too revealing there. Doing a Google search on the company turns up some bland statements about the power of categorization. The original company, CosmixCorp, seems to have operated some bots that crawled web pages. So they at one point did, and probably still do, go out and crawl pages. They don’t display search results using it though, they use Google and others, but I’ll bet they use it to mine relationships. If that’s all they want, they can also be more selective in who they scrape.

Lets look at a few more active topics, instead of just looking at obscure words. The topic page for “scandalous” also has some interesting information. One of the related topic categories is “Year of Birth Missing (living People)”, taken from Wikipedia. The two people under this category are Celeste Bradley and Christina Dodd, who wrote several books with “Scandalous” in the title. Other related topics look good, and actually appear in the search results included below.

Another interesting piece of the puzzle is what happens when we search Kosmix for “George W Bush Flagitious“. The related topics are what one might obtain from matching substrings of the search phrase to topic names. However, searching for “George W Bush profligate” provides a richer set of related topics. This makes sense, profligate is far more common a word than flagitious. Putting George in with any reasonably common term gives you a match. If you put in “George W. Bush Christie Park”, you don’t get the aforementioned “Christie Park (Stadium)” topic, but if you put in the full topic name, with the words in any order, you will get the match. So it seems related topics are pulled using two different methods. First, the search phrases are matched to topic names. Second, deeper connections are pulled by referring to other sites.

SUMMARY

So lets pull this all together and take a guess at how it works. First, they plowed through Wikipedia, or download the Wikimedia Foundation database and yank out topic names and categories. They assigned the first reasonable category to the topic, and use that as the display category for your related links. Second, they crawl web sites, but only for the purpose of exposing relationships. They don’t have to crawl nearly as much as Google. This explains the low-probability relationship between Christie Park and profligate — small number statistics. Rare terms like profligate and flagitious are going to be noisy, but you don’t have to operate a massive crawling infrastructure. The George W. Bush topic page looks fine.

Next, go through all of your topics, and associate them with documents in your search index. This could just mean searching for strings like “Tim Robbins” and marking document you hit. When a user searches for “flagitious”, or “George W. Bush profligate”, do a quick search on your relatively small index, and tally up related topics that appear. You’re not going to expose the search results, just display the relationships. Ship the same search query out to others like Yahoo and Google, and display their results.

That’s it, probably. Its a nice system, I like their results. The categories and topics aren’t always right, but they usually make some sense, and give you a nice way to browse the results. At Daylife, we have to approach relationships differently, since we’re focused on news. With news, its more about exposing relationships in the news today or yesterday, whereas Kosmix and other search engines can take a longer view and not worry about restricting in the time domain.

Anyhow, this is a quick analysis of Kosmix and a guess from an outsider as to how they do it. Comments, additional observations, are welcome.

They also decide what third-party search services to call out to, and this varies by topic.  I’ll look into that in a subsequent post.