We discussed in the earlier post how Kosmix might identify related topics. Another of their nice features, and the one that in my opinion really makes them interesting, is that the third party modules included on the page depend on the topic. Users also have the ability to indicate whether a particular module, for example photos from Daylife, are relevant to the topic. (skip to the end where it says SUMMARY if you don’t want all the boring details)
They already have a database of topics and their Wikipedia category. They also quite likely have the capability to assign modules on a per-topic basis, given that they are allowing users to tweak them.
How do they decide what modules to include? One could image setting up several initial module mixtures for political figures, movies, companies, and so on, and making a preliminary assignment to each topic. Since most of their topics are mapped to Wikipedia entries, that would be an easy way to bootstrap module assignment, since using Wikipedia’s categories reduces the size of the problem greatly. Their categories aren’t great, its not intended to be a taxonomy or heirarchy, but could save some work.
You could also base it on attributes of the Wikipedia entry. Even if the categories in Wikipedia don’t exactly match up to what you want, the entries have a uniformity that makes other types of classification easy. People for example almost always have a birth date specified in the first paragraph which is easy to pick out. It would also be easy to hand-craft rules to identify movies. A few dozen of these, and you’re probably looking good.
There are other ways of doing this of course, well into the region of diminishing returns. Since, as per the previous post, they likely have a set of documents associated with the topic to facilitate exposing related topics, you might consider using a machine learning technique to identify various types of topics. This could be trained using high-reliability assignments from Wikipedia, or through more manual means. For example, if you were able to identify 500 entries that you knew were musical bands, you could use that information to train a simple classifier and make additional assignments. But I would skip all of this machine-learning stuff. A more direct question is, does the module have anything interesting for the topic?
As discussed before, their index might be relatively small and focused on certain sites. Say IMDB is in their index. If a topic hits with IMDB and scores high, identify it as a movie or actor, or just have a separate profile of modules for anything that hits with IMDB. But most of the modules aren’t sites per se, but other search engines. You wouldn’t build up a search index for MeeHive. Assigning categories based on your index doesn’t seem like a great idea either, too noisy. So I don’t think they’re doing any of this.
Here is a simple method. Look at the results coming out of the third party searches, and use that to determine whether to include the module. If you search YouTube for “george bush flagitious” and come up with nothing, or with low-quality video (old, few views, low ratings) don’t include the module. The HTML being served up by their site already has the results baked into it — they aren’t scripts or widgets that populate separately. So you can throw 50 modules onto the George W. Bush page, give each of them 1 second to respond, and display whatever gets back to you in time and seems to be of high quality. Cache the page for a few minutes, so that popular ones load quickly. It also gives you a way of populating modules intelligently even if someone searches for “george bush flatigious”.
That sounds easy, but not really. Not all services will report a relevance score. YouTube is easy, they’ll report number of views and ratings, and age. So figuring out whether a module is returning good stuff might be difficult. You can at least scan for the terms you put in. If you can’t get a handle on relevance, you can push them further down on the page. The George Bush topic page has a module for How-To videos at the bottom. Not a huge deal, maybe someday users will help you clean it up. You also don’t want the third party services to start hating you. The cache will help there, but you might also want to record how often the results seem relevant, and use that to determine whether to even call it in the first place. If How-To videos for Bush seem lousy, stop asking for them, but give them another chance in a day or two.
One other important data point, the Microsoft topic page has a couple of financial news outlets on it. These outlets will also surely serve up hits for George Bush, and just about anything else, but they seem to be isolated to company topics. So I’m fairly confident they are classifying topic pages and using that as a guide for module selection. There’s also the issue of ordering the modules. The Microsoft topic page again is very nice, has Wikinvest at the top, and other financial news services. The Britney Spears page has last.fm near the top. That’s probably not all on-the-fly based on what the modules are returning.
Their RightHealth site, which is also nice, is another example where the modules are a bit more selective. I actually like it more than their main Kosmix site, cleaner and the modules are all appropriate. But then its a smaller set of topics, and so easy to curate.
How do you handle searches? If you search for “Microsoft Google” you don’t get all of those nice financial news sources. It looks like they have a set of generic modules to fall back on if its a search as opposed to a topic. That I think is an area for improvement. At the least you could do string matches against your topics, and pull in modules that are indicated for either of them.
SUMMARY
So what’s my best guess here? What would I do? Set up one or two dozen classifications for topics, and make preliminary assignments using hand-crafted rules referencing Wikipedia categories and entries. I wouldn’t spend too much time doing it, or try anything tricky. Perhaps just identify people, companies, and “other” as a broad backstop where a more fine-grained assignments aren’t necessary. Have topic-specific module settings and ordering, set initially by classification and global orderings.
Develop an estimate of relevance for each of the third-party modules, and monitor it for every call. The topic classifications seed what modules are used initially, and their order. When a topic page is requested, call out to all of the modules, given them a second or two to respond, and prune ones that seem to have low relevance. Stop asking them for a while if relevance is consistently low. Monitor the modules closly to see if they’re low performers across the board, and if so reasses how you’re setting their bar for relevance. From there, user input drives it. For searches, scan the query string for topic names and use that as a guide for including modules.
There are a few too many modules for my tastes on most topic pages, and the ones towards the bottom of the page can get kind of weird, like how-to’s for George Bush, but its also a fairly new site and no doubt will improve.