50% of Twitter Posts aren’t Useless

August 17, 2009

I saw an interesting headline on DrudgeReport about Twitter being 40% pointless babble.   But the real news is that 50% aren’t pointless babble, spam, or self-promotion.  That’s pretty good signal-to-noise.


The Fallacies of Arnon Mishkin

August 14, 2009
Arnon Mishkin wrote an interesting piece about the fallacy of the link economy (http://paidcontent.org/article/419-the-fallacy-of-the-link-economy/).  Here’s one of the few pertinent facts cited in his article:
The vast majority of the value gets captured by aggregators linking and scraping rather than by the news organizations that get linked and scraped. We did a study of traffic on several sites that aggregate purely a menu of news stories. In all cases, there was at least twice as much traffic on the home page as there were clicks going to the stories that were on it.
There are a few fallacies in there.  First is the notion of value, and that phrase, “vast majority”.  Value online means selling ads, and there are premium ads, and there are remnant ads.  Aggregators get low-value remnant ads, but publishers get premium ads in some cases.  A page view for a publisher is on average more valuable than a page view for an aggregator.  How much?  The number Jarvis and his group at CUNY came up with was $5-7 RPM for a small publisher, , and for an aggregator you’ll get less.  In the case of Digg, say around $2 RPM based on numbers from Silicon Alley Insider (http://www.businessinsider.com/2008/12/diggs-miserable-business) at 400 million pageviews per month.  That’s probably on the high end for aggregators, most are getting less.  Lets say publishers have three times the revenue per view than aggregators, which I think is conservative.
Second, there is his claim that “twice as much traffic on the home page as there were clicks going to the stories”.  This is misleading.  He’s comparing pageviews with visitors.  But a visitor generates at least one and often more pageviews.  Lets say each visit leads to an average of 4 page views, that’s about average for news publishers.  Cranking through the numbers, that’s 4*(1/2)*3 = 6 times more revenue for publishers than for the aggregator.  So where is his “vast majority”?
There’s another fallacy buried in there:  that this is a zero-sum game between aggregators and publishers, and demand is constant.  You might conclude from the above numbers that aggregators are taking one-seventh of the revenue away from news publishers.  But demand for news is elastic.  Its conceivable that aggregators are driving more traffic to publishers than they would get without them.  Not likely in my opinion, but the point is that if they’re taking one-seventh of the revenue, they’re probably creating some new revenue as well by increasing demand for news.
So Arnon’s facts just don’t support his claims, and his reasoning is flawed.  But lets not stop here.  Look at the balance sheet of aggregators.  They’re not getting rich, and perhaps only a few fools went into this industry thinking it would be lucrative.  I certainly didn’t have any illusions about lucrative pay.  Some companies have had lucrative buyouts, but then some handsome fees were paid to purchase newspapers a few years ago.  Everyone industry has their bubble.  You can point to Google, but Google makes their money elsewhere, not aggregating news.  They’re barely running any ads.  Digg as I posted before isn’t doing well, although probably because their costs are out of control. (http://www.businessinsider.com/2008/12/diggs-miserable-business)
All that being said, I still agree with his final three points.  However reclaiming value from aggregators isn’t going to help them much.  They need subscribers and a pay wall.  Not an iron curtain, but a permeable pay wall along the lines of the Wall Street Journal.  There’s no pot-o-gold out there in the hands of aggregators to help you pay for all that good journalism.

Arnon Mishkin wrote an interesting piece about the fallacy of the link economy.  Jeff Jarvis already responded in detail, but he’s a bit too congenial.  I’m a numbers person, so I’m more blunt.  Here’s one of the few pertinent facts cited in Arnon’s article:

The vast majority of the value gets captured by aggregators linking and scraping rather than by the news organizations that get linked and scraped. We did a study of traffic on several sites that aggregate purely a menu of news stories. In all cases, there was at least twice as much traffic on the home page as there were clicks going to the stories that were on it.

There are a few errors in there.  First is the notion of value, and that phrase, “vast majority”.  Value online means selling ads, and there are premium ads, and there are remnant ads.  Aggregators mostly get low-value remnant ads, but publishers get premium ads in some cases.  A page view for a publisher is on average more valuable than a page view for an aggregator.  How much?  The number Jarvis and his group at CUNY came up with was $5-7 RPM for a small publisher.  For a major aggregator like Digg, its around $2 RPM based on numbers from Silicon Alley Insider, at 400 million pageviews per month.  That’s on the high end for aggregators, most are getting less.  So lets say publishers have three times the revenue per view than aggregators, which I think is conservative.

Next, there is his claim that he saw “twice as much traffic on the home page as there were clicks going to the stories”.  This is misleading.  He’s comparing pageviews with visitors, and those aren’t equal.  A visitor generates at least one and often more pageviews.  Lets say each visit leads to 3 page views, that’s about average for news publishers, although you might argue that traffic from aggregators is less likely to stick around.  Also, news outlets generate their own traffic, it doesn’t all come through aggregators.  For the NYTimes about half comes from other referers, only some of which are aggregators.  So there’s another factor of two.  Cranking through the numbers, that’s 3*(1/2)*3*2 = 9 times more revenue for publishers than for the aggregator.  So is that a “vast majority” of the value?  To me a majority is more than 50%, lets peg a “vast” majority at somewhere in excess of 75%.  Even allowing for some errors, and I’d have to be off by a lot, aggregators aren’t getting anywhere near 75% of the revenue from online news.

There’s another fallacy buried in there:  that this is a zero-sum game between aggregators and publishers, and demand is constant.  You might conclude from the above numbers that aggregators are taking one-tenth of the revenue away from news publishers.  But demand for news is elastic.  Its conceivable that aggregators are driving more traffic to publishers than they would get without them.  Not likely in my opinion, but the point is that if they’re taking one-tenth of the revenue, they’re probably creating some new revenue as well by increasing demand for news.

So I don’t buy Arnon’s argument.  But lets not stop there.  Look at the balance sheet of aggregators.  They’re not getting rich, although perhaps a few fools went into this industry thinking it would be lucrative.  I certainly didn’t have any illusions.  Some companies have had lucrative buyouts, but then some handsome fees were also paid to purchase newspapers a few years ago.  Every industry has their bubble.  You can point to Google, but Google makes their money elsewhere, not aggregating news.  They’re barely running any ads.  Digg as I mentioned earlier isn’t doing well, although probably because their costs are out of control.

All that being said, I still agree in principle with his final three points.  However reclaiming value from aggregators isn’t going to help publishers much.  They need subscribers and a pay wall.  Not an iron curtain, but a permeable pay wall along the lines of the Wall Street Journal.  There’s no save-my-business-model pot of gold out there in the hands of aggregators to help you pay for all that good journalism.


Gartner Hype Cycle Heading Towards Trough of Disillusionment

August 13, 2009

I was reading a post on the latest Gartner Hype Cycle for emerging technologies, and it occurred to me that Hype Cycles are heading towards the  Trough of Disillusionment, and don’t actually have a Plateau of Productivity.


The WSJ’s Permeable Pay Wall, Part 2

August 12, 2009

I dug up some additional information on their pay wall, from an excellent interview with Nieman Journalism Lab back in April.  It explains some of their method behind my observations in an earlier post:

  • Politics, arts, opinion, and breaking news are all free
  • Very popular articles are free
  • Exclusives that will just be repeated elsewhere (“WSJ is reporting…”) are free

The WSJ’s Permeable Pay Wall

August 12, 2009

I was reading a post from Jeff Jarvis on Rupert’s pay wall, when it occurred to me that not everyone knows all of the details of how it works.  It’s not as hard of a pay wall as the Times Select put up a few years back.  Both walls cover only some of the content, but there are a few situations where the WSJ lets anyone in.

The WSJ content is either flagged as Subscriber Content, which has a small key next to the headline, or is open to anyone.  However Subscriber Content is available to anyone if you are referred from one of the following sites:

  • Google (news or search, anything Google.com, but not GMail)
  • MySpace
  • Digg
  • Marketwatch
  • Barrons (online.barrons.com)

I checked a host of other aggregators and search engines:  Bing, MSN, Yahoo, ShashDot, StumbleUpon, Twitter, Mixx, Newsvine, Facebook, Fark, Reddit, Drudge Report, and so on.  Nothing else hit.  MySpace, Marketwatch, and Barrons are easy to explain, they’re all News Corp. properties.  The New York Post, and other News Corp. properties, don’t get a pass.  Digg is probably the lone aggregator because it drives a lot of traffic, and was head-and-shoulders above similar sites back when they set the pay wall rules.  Those rules could probably use some revisiting.  Why Google but not Bing?  They don’t have a special relationship with Google, so I don’t see a reason to discriminate.

Occasionally you will see something labeled Subscriber Content which has been opened to everyone.  For example, I found a popular WSJ article on Twitter that is labelled Subscriber Content which can be read in full even if you aren’t a subscriber.  How that happens I’m not sure.  Perhaps if a subscriber shares a link for some protected content, they might decide to make it free, either automatically or by alerting a human.  The Drudge Report frequently links to WSJ articles, but surveying a few days of links I couldn’t find examples of Subscriber Content being linked to.  Although I would wager that if Drudge did link to Subscriber Content, they’d open it up.

So the WSJ has a permeable pay wall.  Jeff’s point about not getting as much Googlejuice is a good one, but I doubt the WSJ is losing out on much.  They’ve probably had a high rank from the day Google launched, and enough links around the web that their rank won’t be falling anytime soon.  They certainly loose out on some traffic by not giving bloggers and newer social networking sites the same pass as Digg and MySpace.

How much are they losing?  Who knows.  Looking at referer data from compete.com and alexa.com, they don’t look too different from their peers.  Their traffic has been steadily increasing.  In fact, they’ve done a better job at adding pageviews than the NYTimes, which is down based on Alexa stats, and flat based on compete.com.

What does that tell me?  There are other important drivers of web traffic than just whether you have a pay wall.  Hmm, maybe it has something to do with quality, and Rupert is on to something.


A Peek at the Kosmix Backend

June 26, 2009

As you probably know I’m always looking at how companies in our space do their stuff.  A few days ago Ted Dziuba provided some interesting information on the Kosmix backend, which I found through Josh Young’s nice “What I’m Reading” feed:

…Kosmix… wrote its own data store in C++.  It’s basically a clone of Google’s GFS…

After I edited out the snark and obscenities, that was all that’s left of his post.  The rest of it, calling them a search engine, would only rankle the Kosmix co-founders.  Dziuba is an interesting read, and he has a good sense of humor.  But he’s never going to make it as a pro linebacker.  He’s just too small and weak.  He needs to spend less time at the keyboard and more at the gym, and get his hands on some steroids.


Cheap EVRI Knockoff on the Huffington Post

June 25, 2009

Is it just me, or does the connections widget on the left column of this page look suspiciously like EVRI’s widget.  When I saw it, I thought it was EVRI’s work.  The colored circles, the topic name in a box on top of the circle, lines between them, the size ratio, the line color, all the same.  But there’s no Evri branding, it doesn’t make a call to the Evri site to fetch data as the WashPo installation does.  So I don’t think its EVRI.  They’re using OpenCalais for local news, could be their tagging, or it could be human tagging with a simple database to power the widget.  Who knows, but it would be better if they were a bit more distinctive in the presentation.  It is confusingly similar to EVRI’s widget.

Judge for yourself.  Here’s a screen shot of the HuffPo widget:

huffpo-evri-widget

And here’s the Evri widget as it appears on their site:

evri-widget

Samuel Clay, one of the crew here at Daylife, says EVRI uses SVG/Raphael, and the HuffPo is just images, so it is a different technology and almost certainly not EVRI’s work.


Harvard BSchool Twitter Study and Bad Statistics

June 5, 2009

Harvard Business School put out some interesting statistics back on June 1.  They raise some interesting questions, but the statistics have some issues, so they should be taken with a grain of salt.

The first is their comparison with Wikipedia editors.  They state that:

…the pattern of contributions on Twitter is more concentrated among the few top users than is the case on Wikipedia, even though Wikipedia is clearly not a communications tool. This implies that Twitter’s resembles more of a one-way, one-to-many publishing service more than a two-way, peer-to-peer communication network.

However the comparison is based only on Wikipedia editors, not those that use Wikipedia.  Just as their study indicates many on Twitter merely follow, many Wikipedia users only read articles and do not contribute as editors.  A fair comparison would exclude some fraction of Twitter users.

The gender biases are also interesting.  However, they limit the gender study to “strongly gendered names”.  Furthermore, as a coworker of mine pointed out, women may be more unlikely to expose their real name online.  So their gender sampling method may be biased.  Would it invalidate their conclusions?  Perhaps not, that would require a bias of at least the order of the gender bias they measure (10% or so).  But they don’t indicate what fraction of users could be assigned a gender, nor did they investigate what fraction of the population can be gender-typed using their list.  The latter would be easy to do with census data.

Studies like this are never perfect, but I generally expect those that conduct studies to call out possible biases in their sampling, and report more fully on the data and their methodology.  I would discard their conclusions about Twitter as a one-way publishing platform as unsupported by their data, and would take their gender bias numbers as a rough estimate with potentially significant biases.


Newpapers Considering the ASCAP Model For Online Payments

June 4, 2009

This morning the WSJ reported that at a private meeting of newspaper executives last week, one of the models they considered for extracting revenue online was ASCAP, the American Society of Composers, Authors, and Publishers.  (Page A11, or online here).  That’s great, but there are a few technical considerations, and you can do a lot better than ASCAP.  So I thought I would weigh in with some concrete suggestions about how such a system might be implemented so that it is accurate and fair.

First, some background.  I was the Director of Research at MediaSentry, where we provided music companies with intelligence about online piracy, attempted to interdict or frustrate would-be downloaders, and collected evidence for civil actions.  So I know quite a bit about monitoring and protecting online content.  Also, through a composer I’m acquainted with, I’ve heard a good deal about the drawbacks of ASCAP. 

ASCAP pays composers according to a formula based on the amount of play-time for their songs.  That play-time includes things such as bars, and just about anything except humming it in the shower.  On the other end of the cash flow, bars, radio stations, television programs, all pay into ASCAP based on estimates of the size of the audience.  The tricky part is figuring out how much each composer gets paid. That requires a lot of sampling, since your neighborhood bar doesn’t report what songs they’re playing.  That just requires listening to some radio stations and spending time in bars, or getting a few of them to self-report.  The sampling is such that small composers tend to get nothing, because they are below ASCAP’s threshold of statistical significance.  

How do you build a system that is accurate and fair?  How would sampling work for online news, and how would you price membership for participants?  You just can’t get reliable pageview information.  You could ask for it, but it would have to be taken on good-faith, and that’s no way to determine payments.  You can’t even get reliable traffic estimates for a site.  Alexa and Compete.com aren’t accurate enough, and won’t distinguish between a site’s own content and content they’ve taken from elsewhere.  There’s the Nielsen Ratings system, but read the criticisms on their Wikipedia entry before you seriously consider replicating their sampling method.  

One practical system would be to put a callback on every article with an identifier for the site.  That callback reports to the newspaper’s version of ASCAP that the article has been viewed.  Its basically the same as how Google Analytics works.  It can be defeated client-side, by simply blocking the request, but there is no incentive for consumers to do so — it only impacts payments between other parties, and does not interfere with their web experience.  Sampling would not be required, as nearly complete information is available.

There are a few loop-holes.  How would you count full-text RSS feeds?  Triggering the same callback method for every feed request would overcount, and there is no mechanism for triggering callbacks in the reader.  There are many different types of RSS consumers.  Here I think you would make a different callback and discount RSS ingestion based on some agreed-upon factor, say 10%.  One would need to run a study to determine, given a certain number of RSS feed requests, how many result in reading or browsing an article.   What about other devices?  The Amazon Kindle?  Here again, you need a separate type of callback, with its own factor.  API’s?  Easy, just do a callback server-side whenever an article is fed out, but this is less easy to verify.  

The newspaper version of ASCAP also needs verifiability.  This is relatively easy.   Sites can be crawled, and the presence of a client-side callback function verified.  In the same way, you can look at the source of any web page and determine whether they are using Google Analytics — the javascript code is in plain view.  For API’s and RSS feeds, its not as easy to verify.  However one can do a simple statistical analysis using a method similar to how lock-in amplifiers are used to extract signals in the presence of even very high noise (other users).  Make RSS or API calls yourself with some fixed modulation, at a level far below their overall usage.  That it can be so far below is the magic of the lock-in amp.  Confirm the presence of the same signal in the callbacks you receive.  Its relatively easy to implement, and the statistics are bomb-proof.  Do it randomly, and relatively infrequently, to verify compliance.  For full-text RSS and API methods of delivery, you’ll need to warn partipants to make the server-side callback within a second of the request, and not cache that request.  Yes, its being a bit anal, but just the presence of a verification mechanism will keep everyone in line. 

With all of that readership data in hand, you can both accurately bill participants in the system, and accurately pay content owners.  It doesn’t have the sampling issues of ASCAP that tend to short-change smaller contributors.  You could even bill based on actual usage, instead of signing a contract upfront and calculating fees with an obscure formula, as ASCAP does.  

In short, this system seems to have some promise, and might in fact work.  Its accurate, easy to implement, and fair to all content owners and consumers.


Topsy Providing a Better Twitter Search

May 27, 2009

Topsy, new Twitter search site, launched yesterday.  Searching for “Daylife” or “Obama” gives you nice results.  The ranking is purely based on the number of times the link has been tweeted, for whatever time window you select:  hour, day, week, month, all-time.  When grouping by URLs, they resolve compressed versions from bit.ly or tinyurl.  The resolved link, along with the number of tweets, and the text of a recent tweet, are displayed.  Along the right margin, you get a list of common users and the number of times they’ve contributed to the result set.  You can also search within a particular user.

If you click on a tweet group, you’ll get a list of posts, and a list of “What’s Related”.  The quality of the related tweet groups is a bit sketchy.  Tweets are short, not much to go on.  You could imagine just pulling out the most statistically overrepresented words from the a tweet group.  For the baseline you’d want to calculate word frequency from a lot of tweets, not a corpus of standard written English.  Tweets have their own language.  Take the top few words, throw them into a full-text search engine, and group the results by their resolved URL.  I’m sure their are other ways, but that should be easy and give you good results.  And only index tweets that have a URL in them.

As I mentioned before in a post about real-time search, its not terribly complicated.  But it is nicely done.  The tricky part is getting a feed of tweets from Twitter.  Once you have it, you pull every URL.  For each one, do a HTTP HEAD request to get past bit.ly, tinyurl and the like.  That will save you some time and bandwidth over downloading the entire page.  Perhaps have a short list of URL-shortening services and use those to filter.  Some of them, like bit.ly, also provide an API where you can expand the shortened version without doing the HEAD request.

Once you’ve done that, its all a matter of setting up some indexing.  You have the URLs, timestamps, usernames, and tweet text.  Since you’re grouping by URL’s, you might also consider only indexing tweets that have links — that will reduce the size of your indicies.  This is not rocket science.  Its about having access to the data.  I was baffled by their $15MM purchase of Summize.  They, and Topsy, could have been built for far less.  I suppose in the case of Summize a premium was paid to get it quickly.  According to CrunchBase, Topsy raised $15MM.  Of that, $11MM was raised last December.  I wonder why they need all that money?  Right now they list 14 people on their web site, and their product feels like a $4MM product, what they have so far is not terribly complicated even when implemented for scale.  Perhaps they have big plans for the future, or anticipate big hardware bills and low revenue in the short run.  

So although I like Topsy very much, they are dependent on Twitter for their feed of tweets, and whatever current value you might give to their enterprise, most of that value is in their access to the complete tweet feed.  Twitter controls that, and can shut it off, so they don’t seem like a good target for acquisition for anyone except Twitter.  They could help defend their value with patents, so that for example Twitter could not copy and implement it themselves, but its not that novel, and one could invent around it.  I don’t see any patent applications on file at the USPTO.  Or as with Summize, they might hope to fetch another premium for getting a search product out quickly.  Twitter already did that once, I doubt they’ll do it again, they already have search expertise in-house now.  Perhaps they want to get a big user base, and hope that Twitter won’t have the guts to shut them down — negotiating with someone who’s holding a gun to your head, and hoping they don’t want to bloody their shirt.  

You don’t raise $11MM without a plan, even with all the crazy hype over real-time search.  You have to think BlueRun Ventures and Ignition Partners know what they’re doing.  That money probably put the finishing touches on their current search offering, but will largely be going towards new products that would be better targets for acquisition.  Their investors have experience with mobile technology, perhaps they have a clever way to move this onto mobile devices.  But they’ll still be twice-removed from the real revenue — text messaging subscription fees.  

Who knows?  Anyhow that’s my take on topsy:  great site, simple technology, uncertain business model.