A Peek at the Kosmix Backend

June 26, 2009

As you probably know I’m always looking at how companies in our space do their stuff.  A few days ago Ted Dziuba provided some interesting information on the Kosmix backend, which I found through Josh Young’s nice “What I’m Reading” feed:

…Kosmix… wrote its own data store in C++.  It’s basically a clone of Google’s GFS…

After I edited out the snark and obscenities, that was all that’s left of his post.  The rest of it, calling them a search engine, would only rankle the Kosmix co-founders.  Dziuba is an interesting read, and he has a good sense of humor.  But he’s never going to make it as a pro linebacker.  He’s just too small and weak.  He needs to spend less time at the keyboard and more at the gym, and get his hands on some steroids.


Cheap EVRI Knockoff on the Huffington Post

June 25, 2009

Is it just me, or does the connections widget on the left column of this page look suspiciously like EVRI’s widget.  When I saw it, I thought it was EVRI’s work.  The colored circles, the topic name in a box on top of the circle, lines between them, the size ratio, the line color, all the same.  But there’s no Evri branding, it doesn’t make a call to the Evri site to fetch data as the WashPo installation does.  So I don’t think its EVRI.  They’re using OpenCalais for local news, could be their tagging, or it could be human tagging with a simple database to power the widget.  Who knows, but it would be better if they were a bit more distinctive in the presentation.  It is confusingly similar to EVRI’s widget.

Judge for yourself.  Here’s a screen shot of the HuffPo widget:

huffpo-evri-widget

And here’s the Evri widget as it appears on their site:

evri-widget

Samuel Clay, one of the crew here at Daylife, says EVRI uses SVG/Raphael, and the HuffPo is just images, so it is a different technology and almost certainly not EVRI’s work.


Harvard BSchool Twitter Study and Bad Statistics

June 5, 2009

Harvard Business School put out some interesting statistics back on June 1.  They raise some interesting questions, but the statistics have some issues, so they should be taken with a grain of salt.

The first is their comparison with Wikipedia editors.  They state that:

…the pattern of contributions on Twitter is more concentrated among the few top users than is the case on Wikipedia, even though Wikipedia is clearly not a communications tool. This implies that Twitter’s resembles more of a one-way, one-to-many publishing service more than a two-way, peer-to-peer communication network.

However the comparison is based only on Wikipedia editors, not those that use Wikipedia.  Just as their study indicates many on Twitter merely follow, many Wikipedia users only read articles and do not contribute as editors.  A fair comparison would exclude some fraction of Twitter users.

The gender biases are also interesting.  However, they limit the gender study to “strongly gendered names”.  Furthermore, as a coworker of mine pointed out, women may be more unlikely to expose their real name online.  So their gender sampling method may be biased.  Would it invalidate their conclusions?  Perhaps not, that would require a bias of at least the order of the gender bias they measure (10% or so).  But they don’t indicate what fraction of users could be assigned a gender, nor did they investigate what fraction of the population can be gender-typed using their list.  The latter would be easy to do with census data.

Studies like this are never perfect, but I generally expect those that conduct studies to call out possible biases in their sampling, and report more fully on the data and their methodology.  I would discard their conclusions about Twitter as a one-way publishing platform as unsupported by their data, and would take their gender bias numbers as a rough estimate with potentially significant biases.


Newpapers Considering the ASCAP Model For Online Payments

June 4, 2009

This morning the WSJ reported that at a private meeting of newspaper executives last week, one of the models they considered for extracting revenue online was ASCAP, the American Society of Composers, Authors, and Publishers.  (Page A11, or online here).  That’s great, but there are a few technical considerations, and you can do a lot better than ASCAP.  So I thought I would weigh in with some concrete suggestions about how such a system might be implemented so that it is accurate and fair.

First, some background.  I was the Director of Research at MediaSentry, where we provided music companies with intelligence about online piracy, attempted to interdict or frustrate would-be downloaders, and collected evidence for civil actions.  So I know quite a bit about monitoring and protecting online content.  Also, through a composer I’m acquainted with, I’ve heard a good deal about the drawbacks of ASCAP. 

ASCAP pays composers according to a formula based on the amount of play-time for their songs.  That play-time includes things such as bars, and just about anything except humming it in the shower.  On the other end of the cash flow, bars, radio stations, television programs, all pay into ASCAP based on estimates of the size of the audience.  The tricky part is figuring out how much each composer gets paid. That requires a lot of sampling, since your neighborhood bar doesn’t report what songs they’re playing.  That just requires listening to some radio stations and spending time in bars, or getting a few of them to self-report.  The sampling is such that small composers tend to get nothing, because they are below ASCAP’s threshold of statistical significance.  

How do you build a system that is accurate and fair?  How would sampling work for online news, and how would you price membership for participants?  You just can’t get reliable pageview information.  You could ask for it, but it would have to be taken on good-faith, and that’s no way to determine payments.  You can’t even get reliable traffic estimates for a site.  Alexa and Compete.com aren’t accurate enough, and won’t distinguish between a site’s own content and content they’ve taken from elsewhere.  There’s the Nielsen Ratings system, but read the criticisms on their Wikipedia entry before you seriously consider replicating their sampling method.  

One practical system would be to put a callback on every article with an identifier for the site.  That callback reports to the newspaper’s version of ASCAP that the article has been viewed.  Its basically the same as how Google Analytics works.  It can be defeated client-side, by simply blocking the request, but there is no incentive for consumers to do so — it only impacts payments between other parties, and does not interfere with their web experience.  Sampling would not be required, as nearly complete information is available.

There are a few loop-holes.  How would you count full-text RSS feeds?  Triggering the same callback method for every feed request would overcount, and there is no mechanism for triggering callbacks in the reader.  There are many different types of RSS consumers.  Here I think you would make a different callback and discount RSS ingestion based on some agreed-upon factor, say 10%.  One would need to run a study to determine, given a certain number of RSS feed requests, how many result in reading or browsing an article.   What about other devices?  The Amazon Kindle?  Here again, you need a separate type of callback, with its own factor.  API’s?  Easy, just do a callback server-side whenever an article is fed out, but this is less easy to verify.  

The newspaper version of ASCAP also needs verifiability.  This is relatively easy.   Sites can be crawled, and the presence of a client-side callback function verified.  In the same way, you can look at the source of any web page and determine whether they are using Google Analytics — the javascript code is in plain view.  For API’s and RSS feeds, its not as easy to verify.  However one can do a simple statistical analysis using a method similar to how lock-in amplifiers are used to extract signals in the presence of even very high noise (other users).  Make RSS or API calls yourself with some fixed modulation, at a level far below their overall usage.  That it can be so far below is the magic of the lock-in amp.  Confirm the presence of the same signal in the callbacks you receive.  Its relatively easy to implement, and the statistics are bomb-proof.  Do it randomly, and relatively infrequently, to verify compliance.  For full-text RSS and API methods of delivery, you’ll need to warn partipants to make the server-side callback within a second of the request, and not cache that request.  Yes, its being a bit anal, but just the presence of a verification mechanism will keep everyone in line. 

With all of that readership data in hand, you can both accurately bill participants in the system, and accurately pay content owners.  It doesn’t have the sampling issues of ASCAP that tend to short-change smaller contributors.  You could even bill based on actual usage, instead of signing a contract upfront and calculating fees with an obscure formula, as ASCAP does.  

In short, this system seems to have some promise, and might in fact work.  Its accurate, easy to implement, and fair to all content owners and consumers.


Topsy Providing a Better Twitter Search

May 27, 2009

Topsy, new Twitter search site, launched yesterday.  Searching for “Daylife” or “Obama” gives you nice results.  The ranking is purely based on the number of times the link has been tweeted, for whatever time window you select:  hour, day, week, month, all-time.  When grouping by URLs, they resolve compressed versions from bit.ly or tinyurl.  The resolved link, along with the number of tweets, and the text of a recent tweet, are displayed.  Along the right margin, you get a list of common users and the number of times they’ve contributed to the result set.  You can also search within a particular user.

If you click on a tweet group, you’ll get a list of posts, and a list of “What’s Related”.  The quality of the related tweet groups is a bit sketchy.  Tweets are short, not much to go on.  You could imagine just pulling out the most statistically overrepresented words from the a tweet group.  For the baseline you’d want to calculate word frequency from a lot of tweets, not a corpus of standard written English.  Tweets have their own language.  Take the top few words, throw them into a full-text search engine, and group the results by their resolved URL.  I’m sure their are other ways, but that should be easy and give you good results.  And only index tweets that have a URL in them.

As I mentioned before in a post about real-time search, its not terribly complicated.  But it is nicely done.  The tricky part is getting a feed of tweets from Twitter.  Once you have it, you pull every URL.  For each one, do a HTTP HEAD request to get past bit.ly, tinyurl and the like.  That will save you some time and bandwidth over downloading the entire page.  Perhaps have a short list of URL-shortening services and use those to filter.  Some of them, like bit.ly, also provide an API where you can expand the shortened version without doing the HEAD request.

Once you’ve done that, its all a matter of setting up some indexing.  You have the URLs, timestamps, usernames, and tweet text.  Since you’re grouping by URL’s, you might also consider only indexing tweets that have links — that will reduce the size of your indicies.  This is not rocket science.  Its about having access to the data.  I was baffled by their $15MM purchase of Summize.  They, and Topsy, could have been built for far less.  I suppose in the case of Summize a premium was paid to get it quickly.  According to CrunchBase, Topsy raised $15MM.  Of that, $11MM was raised last December.  I wonder why they need all that money?  Right now they list 14 people on their web site, and their product feels like a $4MM product, what they have so far is not terribly complicated even when implemented for scale.  Perhaps they have big plans for the future, or anticipate big hardware bills and low revenue in the short run.  

So although I like Topsy very much, they are dependent on Twitter for their feed of tweets, and whatever current value you might give to their enterprise, most of that value is in their access to the complete tweet feed.  Twitter controls that, and can shut it off, so they don’t seem like a good target for acquisition for anyone except Twitter.  They could help defend their value with patents, so that for example Twitter could not copy and implement it themselves, but its not that novel, and one could invent around it.  I don’t see any patent applications on file at the USPTO.  Or as with Summize, they might hope to fetch another premium for getting a search product out quickly.  Twitter already did that once, I doubt they’ll do it again, they already have search expertise in-house now.  Perhaps they want to get a big user base, and hope that Twitter won’t have the guts to shut them down — negotiating with someone who’s holding a gun to your head, and hoping they don’t want to bloody their shirt.  

You don’t raise $11MM without a plan, even with all the crazy hype over real-time search.  You have to think BlueRun Ventures and Ignition Partners know what they’re doing.  That money probably put the finishing touches on their current search offering, but will largely be going towards new products that would be better targets for acquisition.  Their investors have experience with mobile technology, perhaps they have a clever way to move this onto mobile devices.  But they’ll still be twice-removed from the real revenue — text messaging subscription fees.  

Who knows?  Anyhow that’s my take on topsy:  great site, simple technology, uncertain business model.


Manhattanhenge: Loving Wolfram Alpha

May 26, 2009

A coworker posted something today about Manhattanhenge occuring on May 30 of this year.  That’s when sunset aligns with the east-west grid of Manhattan.  The streets are offset from true east-west by 28.9 degrees, according to Wikipedia.  So I went to Wolfram Alpha to figure out the solar azimuth on that day.  It said 30 degrees 30 minutes.  Wait a second, that’s not 28.9 degrees!  So initially I was confused.  However, you’ll note that it indicates an altitude of -1 degrees.  Because of refraction, the sun appears higher than it actually is, in this case about a degree.  Since the sun rises and sets at an angle to the horizon, the azimuth is also effected.  Furthermore, sunset is the point at which the trailing edge vanishes.  For dating Manhattanhenge, you probably want the leading edge so that it sits right on top of the street.  The angular diameter of the sun is 31 arc-minutes.  So you want the date when the azimuth is at 28.9 degrees and the altitude is about 30 minutes.  That gives you May 30th at 8:13pm.  It gives 29 degrees, close enough to 28.9.  However, the azimuth is +30 minutes.  Visual sunset is closer to an altitude of -1 degrees, so I’d think the correct altitude to look for would be -30 minutes.  However the astronomers that gave the 30th as the date perhaps used astronomical sunset instead of visual sunset.  

The Hayden Planetarium gives 8:17pm as the time of Manhattanhenge.  For that date and time, Wolfram Alpha provides an altitude of -10 minutes, and an azimuth of 29 degrees 40 minutes.  Perhaps the altitude of -10 minutes is a better number for when the leading edge of the sun will hit the street.

Anyhow, you can get pretty close with just Wolfram Alpha and some simple geometry.  I’m really loving that site.

—-
update:  I found a better number for the shift in the apparent position of the sun at sunset, about 35 minutes.   The -1 degree altitude reported at sunset by WA is off, or perhaps it has been rounded.  If you back out the diameter of the sun, which is 31 minutes, that’s pretty close to the numbers WA gives for Hayden Planetarium’s date and time.  So with a bit of extra knowledge, you can accurately find the date and time of Manhattanhenge from Wolfram Alpha.  It won’t solve for it, or at least I can’t figure out how to ask it for dates and times where the azimuth is 28.9 and the altitude is -5 or -10 minutes, but you can at least do a quick search around likely dates to find the right numbers.


Google to Help Readers Dig Into Stories

May 22, 2009

A close reading of the Financial Times interview with Google CEO Eric Schmit revealed an interesting remark (emphasis mine):

we are very interested in trying to develop online news versions that somehow address the immediate needs of people and for which advertising works better. Without commenting specifically about products it seems to me that the newspaper that I read online should remember what I read. It should allow me to go deeper into the stories. It’s that kind of a discussion that we’re having.

Google will be helping readers go deeper into stories.  That’s a crowded space, so move over everyone and make way for the 800lb gorilla.  We’ll see what they come up with.  I suspect, based on things I’ve heard, that if they do such a thing they’d give it away for free, although that would probably mean sharing advertising revenue.  

Its also probably a naive goal.  Most readers don’t like to go deeper into stories.  Most of them don’t get through the full article.  There’s nothing wrong with that.  I would venture that if more than 20% of readers made it to the end of an article, it should have been longer.  You had their attention, and you probably had more material you could have included.  No, I think most readers instead of going deeper are interested in going sideways.  That’s one reason I find Wikipedia so addictive, there are so many paths that let you skip sideways to a loosely related topic.  But perhaps I’m over-analyzing Schmidt’s arguments.  I sure would like to dig deeper into them.

Even if nothing comes of this, or its still-born, they’re chipping away at their image as an enemy publishers.


Twitter and the Hype Around Real-Time Search

May 18, 2009

I came across an article in CNN on new search engines that aspire to supplement Google.

And sites like Twitter are trying to capitalize on the warp-speed pace of online news today by offering real-time searches of online chatter — something Google’s computers have yet to replicate…  If you search Google news, the results will be recent, but not live. That’s where Twitter’s search comes in. It searches the site’s micro-blog posts by the second, allowing users to see what’s buzzing on the Web at any instant. 

Wow, Twitter/Summize can sort matches by date and return the most recent ones.  First of all, Google’s computers can replicate that trivially, and they already offer real-time search of my Gmail and chats.  Sorting by date, most recent first, is far simpler than looking at relevance and removing duplicates, which is what Google News and their web search is doing.  Its actually trivial to implement.  The only hard part is scaling it up to Twitter size, and that’s only hard in the sense that its a lot of work, since its a well-beaten path.  Summize did it with a very small crew.  

Google does offer the ability to sort-by-date.  However it takes some time for them to acquire content.  Twitter is a bit easier, everyone posts messages to a central system, and the total amount of text you need to index is fairly small.   Back in November they reported about 15 to 20 tweets per second, traffic via compete.com has gone up by a factor of 6 since then, and so I figure they’re doing around 10 million tweets per day.  Call it 100 bytes per tweet, and its around a Gigabyte per day that needs to be indexed.  The inverted index is simple, for every word you keep a date-sorted list of occurrances, along with position to handle quoted quoted phrases.  

They offer a few other options, like user and location, which you’ll also have to index.  You could fit a week’s worth of traffic into memory on a relatively cheap box.  Relatively old data can be kept on disk.  You could store an index of all tweets on one machine, so managing data replication is simple — everyone gets a copy of everything.   At some point you’ll have to partition the data, which you can do by date, or randomly and have them execute in parallel.  

It looks like Twitter has trimmed their search index aggressively.  At the moment, I see no search results prior to April 29.   On April 29 however I start to see things again.  Note that by the time you read this, both results may return nothing.   Searching for “Daylife” in the Daylife feed also stops 18 days ago.  So they seem to have 20 days in their index.  I’ll bet that works out to around 32GB for their index, and they’ve stuffed that much memory onto a single box which they’ve replicated.  Or perhaps a couple of boxes with less memory and the data spread randomly across them.  For older Tweets, they haven’t bothered with a slower disk-based search option, or with the hardware costs necessary to stuff it all into memory.  

It makes sense, they’ve kept it simple to focus on the most important feature: quickly searching only the most recent tweets.  But real-time search is nothing tricky, nor is it anything new.  The only new thing here is the popularity of Twitter.


Springtime

May 17, 2009

The blog has slowed down the last month and a half, spring has just been too busy:  getting the garden started, refinishing the sunroom, brewing beer for the summer, and of course having fun with the baby.  But the onslaught of new companies and products has continued unabated, so I have some catching up to do, and will backfill a few of them in the coming weeks.


The Wolfram Alpha Launch

May 17, 2009

Wolfram Alpha launched on Friday evening.  I watched a bit of the video, it was an interesting look at their launch process and hardware.  They have 6 colo’s, the largest of which has around 4,000 cores and a few hundred terabytes of data, with fairly high density 32-core 2U units from Dell.  Pretty big guns for a launch, so they mean business.  Necessary perhaps because of the press they’ve received, which is way over the top.  At this moment, it’s actually the first link on the Drudge Report, claiming that they’re challenging Google and have “all the answers“.  Wolfram isn’t claming to have all the answers, but they can answer a few that Google doesnt.  And they generally answer with hard data.  I like hard data.

First example, the word “thought”.   It gives the word frequency in spoken and written language, broader terms, narrower terms, a visual synonym network, the percent of the .  Beautiful!  Except it neglects the past participle and only lists the noun.  If you search for the word “ran”, it gives you a page about the drug ranitidine, but as a link if you wanted the word.  

Next example, it can factor fairly large numbers.  63 digits!  That’s an NP-hard problem.  And you can do many of the things you might do with Mathematica, like take the Fourier transform of something.  And conveniently, you can click on an equation and get a plaintext version that you can paste back into the query bar.  So consider it Mathemetica-Lite.

In chemistry, it has various thermodynamic properties, will calculate the vapor pressure at different temperatures, and boiling points at various pressures.  You can compare silicon dioxide and silicon nitride.  It assumes you are interested in quartz, but you can correct that assumption.  Its easier than looking up something in the CRC or Merck, if folks still do that these days.  In physics, its a bit limited.  Nothing terribly useful beyond being a light front-end for mathematica, although it offers data on particles.  It won’t solve any abstractly defined problems, but it will tell you the stopping power of various materials for beta radiation.  No data for hot neutrons, alpha particles, or photos it seems.    

The financial information is great — return vs. volatility, alpha, beta, R-square.  You can compare the GDP between the United States and Japan.   

Looking at the long list of examples, and all of the fields they cover, its an impressive assembly of data.  The interface is brittle, you have to learn how to format the questions correctly, and its still spotty in areas.  Its clearly not a general-purpose search engine, so its a pity some press outlets have decided they want to be a Google-killer.  

Other reviews have been mixed.  I think it will prove to be a useful research tool for a lot of people. The interface frequently makes the wrong assumption about what you’re looking for, and its far from complete.  That’s apt to improve, but I think still there will be a significantly steeper learning curve than Google.  You need to learn how to write good queries for the particular domain you’re working in, and until you do your results will be spotty.

I’m not sure how they’re going to make money. Its probably not something advertisers are going to like.  Perhaps it will help with sales of Mathematica.  Clearly they have a lot of data locked up in their servers, and you can only get at it in tiny slices at a time.  They could make good money licensing fuller access to that data, even if it were still housed remotely.  Take for example the GDP data, or the census data.  Imagine if you could launch a local version of Mathematica, then call up an array representing that data, and work it into a model.  They have a few hundred terabytes?  All of that could be at your fingertips.