Twitter and the Hype Around Real-Time Search

May 18, 2009

I came across an article in CNN on new search engines that aspire to supplement Google.

And sites like Twitter are trying to capitalize on the warp-speed pace of online news today by offering real-time searches of online chatter — something Google’s computers have yet to replicate…  If you search Google news, the results will be recent, but not live. That’s where Twitter’s search comes in. It searches the site’s micro-blog posts by the second, allowing users to see what’s buzzing on the Web at any instant. 

Wow, Twitter/Summize can sort matches by date and return the most recent ones.  First of all, Google’s computers can replicate that trivially, and they already offer real-time search of my Gmail and chats.  Sorting by date, most recent first, is far simpler than looking at relevance and removing duplicates, which is what Google News and their web search is doing.  Its actually trivial to implement.  The only hard part is scaling it up to Twitter size, and that’s only hard in the sense that its a lot of work, since its a well-beaten path.  Summize did it with a very small crew.  

Google does offer the ability to sort-by-date.  However it takes some time for them to acquire content.  Twitter is a bit easier, everyone posts messages to a central system, and the total amount of text you need to index is fairly small.   Back in November they reported about 15 to 20 tweets per second, traffic via compete.com has gone up by a factor of 6 since then, and so I figure they’re doing around 10 million tweets per day.  Call it 100 bytes per tweet, and its around a Gigabyte per day that needs to be indexed.  The inverted index is simple, for every word you keep a date-sorted list of occurrances, along with position to handle quoted quoted phrases.  

They offer a few other options, like user and location, which you’ll also have to index.  You could fit a week’s worth of traffic into memory on a relatively cheap box.  Relatively old data can be kept on disk.  You could store an index of all tweets on one machine, so managing data replication is simple — everyone gets a copy of everything.   At some point you’ll have to partition the data, which you can do by date, or randomly and have them execute in parallel.  

It looks like Twitter has trimmed their search index aggressively.  At the moment, I see no search results prior to April 29.   On April 29 however I start to see things again.  Note that by the time you read this, both results may return nothing.   Searching for “Daylife” in the Daylife feed also stops 18 days ago.  So they seem to have 20 days in their index.  I’ll bet that works out to around 32GB for their index, and they’ve stuffed that much memory onto a single box which they’ve replicated.  Or perhaps a couple of boxes with less memory and the data spread randomly across them.  For older Tweets, they haven’t bothered with a slower disk-based search option, or with the hardware costs necessary to stuff it all into memory.  

It makes sense, they’ve kept it simple to focus on the most important feature: quickly searching only the most recent tweets.  But real-time search is nothing tricky, nor is it anything new.  The only new thing here is the popularity of Twitter.


Twitter is Raking in Cash

March 10, 2009

… for the phone companies.  Since I worked at Lucent for a while, I thought I would fill in some gaps on the elusive business model.  

SMS is a brilliant product.  It generates significant revenue for phone companies at virtually no additional cost.  The text messages piggyback on the control channel when it isn’t being used.  It doesn’t require additional hardware, except for exposing a gateway to folks like Twitter.  That’s why its limited to 160 characters, and delivery isn’t always immediate, but still pretty damn fast.  For this, they charge $5 per month for unlimited access.  There are 2.5 million SMS users.  I’m sure not all of them are paying $5 per month, but do the math, its raking in cash.  However, the amount of work that has gone into developing our cellular network is immense.  Gigantic.  

What about twitter?  Who cares, its a simple application a couple of developers put together.  They should be happy if it pulls in enough money to support a dozen people.  They remind me of cup-holders.  They’re great, they hold your beverage while you’re driving, and genuinely improve my enjoyment of the car. They’re also little pieces of plastic, and cost very little to manufacture.

Sure, Twitter has some “mindshare”.  They’re securing a monopoly on an internet service in a way that never would have happened 20 or 30 years ago, and did not happen with email.   Maybe they can extract value if they maintain their monopoly, but I hope they don’t get a major payday.  Too many brilliant people have worked hard to bring about SMS and the cellular network, and I believe in rewarding hard work.  The desire for a Twitter blockbuster business model is a legacy of the same misguided principle that justifies the enormous compensation given to a few parasitic investment bankers and traders.

Eric Schmidt is right, Twitter is a poor man’s email, and it isn’t even an open system like email.  Its innovative and useful, for sure, but so were cup-holders in cars.