I came across an article in CNN on new search engines that aspire to supplement Google.
And sites like Twitter are trying to capitalize on the warp-speed pace of online news today by offering real-time searches of online chatter — something Google’s computers have yet to replicate… If you search Google news, the results will be recent, but not live. That’s where Twitter’s search comes in. It searches the site’s micro-blog posts by the second, allowing users to see what’s buzzing on the Web at any instant.
Wow, Twitter/Summize can sort matches by date and return the most recent ones. First of all, Google’s computers can replicate that trivially, and they already offer real-time search of my Gmail and chats. Sorting by date, most recent first, is far simpler than looking at relevance and removing duplicates, which is what Google News and their web search is doing. Its actually trivial to implement. The only hard part is scaling it up to Twitter size, and that’s only hard in the sense that its a lot of work, since its a well-beaten path. Summize did it with a very small crew.
Google does offer the ability to sort-by-date. However it takes some time for them to acquire content. Twitter is a bit easier, everyone posts messages to a central system, and the total amount of text you need to index is fairly small. Back in November they reported about 15 to 20 tweets per second, traffic via compete.com has gone up by a factor of 6 since then, and so I figure they’re doing around 10 million tweets per day. Call it 100 bytes per tweet, and its around a Gigabyte per day that needs to be indexed. The inverted index is simple, for every word you keep a date-sorted list of occurrances, along with position to handle quoted quoted phrases.
They offer a few other options, like user and location, which you’ll also have to index. You could fit a week’s worth of traffic into memory on a relatively cheap box. Relatively old data can be kept on disk. You could store an index of all tweets on one machine, so managing data replication is simple — everyone gets a copy of everything. At some point you’ll have to partition the data, which you can do by date, or randomly and have them execute in parallel.
It looks like Twitter has trimmed their search index aggressively. At the moment, I see no search results prior to April 29. On April 29 however I start to see things again. Note that by the time you read this, both results may return nothing. Searching for “Daylife” in the Daylife feed also stops 18 days ago. So they seem to have 20 days in their index. I’ll bet that works out to around 32GB for their index, and they’ve stuffed that much memory onto a single box which they’ve replicated. Or perhaps a couple of boxes with less memory and the data spread randomly across them. For older Tweets, they haven’t bothered with a slower disk-based search option, or with the hardware costs necessary to stuff it all into memory.
It makes sense, they’ve kept it simple to focus on the most important feature: quickly searching only the most recent tweets. But real-time search is nothing tricky, nor is it anything new. The only new thing here is the popularity of Twitter.
Posted by Ken Ellis
Posted by Ken Ellis
RSS