Kosmix is a nice site. If you enter a search term, it will stitch together content from a variety of other sites, including Daylife in some cases. They also provide links on the upper right of the page to other Kosmix pages. So how does it work? (Skip to the end where it says SUMMARY if you don’t want boring details)
I like to start by looking at a few obscure examples. Fringe cases are sometimes good ways to probe black boxes. If you search for the word “profligate“, you’ll get two links to other topic pages. The first is a link to the Flagitious topic page, which appears under “Redirects to Wiktionary”. The second is a link to the “Christie Park (Stadium)” topic page, which appears under “Football Venues in England”. The “Redirects to Wiktionary” and “Football Venues in England” are category names right out of Wikipedia. So the first thing we notice is that the topics are at least in some cases taken from Wikipedia categories. The Christie Park entry is similarly interesting. Google’s search engine turns up seven hits for ‘”Christie Park” AND profligate‘. That’s not much at all given the size of the index.
(NB: don’t take this as a reflection on Kosmix’s data quality, I don’t really expect good topic pages for “profligate” and “flagitious”, that’s not the point here, we’re just trying to figure out how they work.)
The difference in case for “Stadium” in the topic name is interesting. Wikipedia changed it from upper to lower case on June 20, 2007, but Kosmix still has it lower case. So they pulled the topic names from before that change and have not updated it since.
So why does Christie Park turn up when I search for “profligate”? If we can answer that question, we’ve done a lot to expose Kosmix’s inner workings. Searching “Christie Park” would not single out the term “profligate”, nor would searching for “profligate” be strongly associated with “Christie Park”. The same is observed for the third-party search services they are pulling onto the topic pages. So the association between the two is a fairly low-probability event. They aren’t mentioned together in Wikipedia, even after scanning page histories.
The page has in the past linked out to other content. You might imagine scraping the Wikipedia entry for “profligate”, following external links, and using those to expose connections. Unfortunately I was unable to retrieve the content linked to by the page in the past, it is no longer available, so I can’t confirm this. So lets look at the other link on the page, “flagitious”. Here the Wikipedia page is very small, and profligate has never in its very brief history (it was deleted a while ago and redirected to Spendthrift, a poor choice). The Wikipedia page for “flagitious” furthermore has never listed “profligate” among its synonyms, nor linked to anything that might contain the word “profligate”.
Checking WordNet, a standard linguistics toolbox that developers might use in constructing a site like Kosmix, doesn’t associate “profligate” and “flagitious”. However, if we look at Webster’s Revised Unabridged Dictionary, we get a hit. It appears in a semicolon-delimited list of similar words for one particular sense of profligate that seems archaic. Note that the dictionary is from 1913 and is unabridged. It also appears in a Word of the Day entry.
How did they arrive at the connection between “profligate” and “flatigious”? Plowing through Websters Unabridged and constructing word relationships is one possibility, but I doubt it. Any reasonably competent practitioner of the art would be aware of WordNet and use it instead, and the relationship is not contained in WordNet. They may be using Wikipedia for topic categories and names, but they’re going to other sources to construct connections.
For the “flagitious” topic page, there are a few other connections that look like synonyms pulled from a dictionary. Miscreat, a synonym, is filed under “Comedy Film”. The semantics there are wrong. Another “Redirects from Wikipedia” entry, for “depraved”. And then there’s Tim Robbins, what is he doing under “flagitious”?
Lets also look at some of the skill sets behind Kosmix and how long they’ve been around, by digging up the LinkedIn profile for Gaurav Bhalotia. Three years old at least, about the same age as Daylife. Relational databases, information retrieval, nothing too exotic or too revealing there. Doing a Google search on the company turns up some bland statements about the power of categorization. The original company, CosmixCorp, seems to have operated some bots that crawled web pages. So they at one point did, and probably still do, go out and crawl pages. They don’t display search results using it though, they use Google and others, but I’ll bet they use it to mine relationships. If that’s all they want, they can also be more selective in who they scrape.
Lets look at a few more active topics, instead of just looking at obscure words. The topic page for “scandalous” also has some interesting information. One of the related topic categories is “Year of Birth Missing (living People)”, taken from Wikipedia. The two people under this category are Celeste Bradley and Christina Dodd, who wrote several books with “Scandalous” in the title. Other related topics look good, and actually appear in the search results included below.
Another interesting piece of the puzzle is what happens when we search Kosmix for “George W Bush Flagitious“. The related topics are what one might obtain from matching substrings of the search phrase to topic names. However, searching for “George W Bush profligate” provides a richer set of related topics. This makes sense, profligate is far more common a word than flagitious. Putting George in with any reasonably common term gives you a match. If you put in “George W. Bush Christie Park”, you don’t get the aforementioned “Christie Park (Stadium)” topic, but if you put in the full topic name, with the words in any order, you will get the match. So it seems related topics are pulled using two different methods. First, the search phrases are matched to topic names. Second, deeper connections are pulled by referring to other sites.
SUMMARY
So lets pull this all together and take a guess at how it works. First, they plowed through Wikipedia, or download the Wikimedia Foundation database and yank out topic names and categories. They assigned the first reasonable category to the topic, and use that as the display category for your related links. Second, they crawl web sites, but only for the purpose of exposing relationships. They don’t have to crawl nearly as much as Google. This explains the low-probability relationship between Christie Park and profligate — small number statistics. Rare terms like profligate and flagitious are going to be noisy, but you don’t have to operate a massive crawling infrastructure. The George W. Bush topic page looks fine.
Next, go through all of your topics, and associate them with documents in your search index. This could just mean searching for strings like “Tim Robbins” and marking document you hit. When a user searches for “flagitious”, or “George W. Bush profligate”, do a quick search on your relatively small index, and tally up related topics that appear. You’re not going to expose the search results, just display the relationships. Ship the same search query out to others like Yahoo and Google, and display their results.
That’s it, probably. Its a nice system, I like their results. The categories and topics aren’t always right, but they usually make some sense, and give you a nice way to browse the results. At Daylife, we have to approach relationships differently, since we’re focused on news. With news, its more about exposing relationships in the news today or yesterday, whereas Kosmix and other search engines can take a longer view and not worry about restricting in the time domain.
Anyhow, this is a quick analysis of Kosmix and a guess from an outsider as to how they do it. Comments, additional observations, are welcome.
They also decide what third-party search services to call out to, and this varies by topic. I’ll look into that in a subsequent post.