A Base of Mickle Might

January 8, 2010

Daylife hosted a nice Freebase workshop in December, and I just posted a piece in our corporate blog.  I really love those folks, and have been pushing them internally and with the occasional client for oh, about a year.  We also put up a nice demo that shows off integration of Freebase and Daylife data. We’re going to be putting that demo to the test shortly to see how it performs.

Perhaps more significantly, anyone can now query Freebase using our public IDs, for example if you want info on this guy.  You can also do the query in the other direction.  Cleaning up and making the Daylife authority keys in Freebase are on my short list of things to do, right now they’re spotty and don’t fully reflect the mappings we have internally.

Other random things gleaned from the talk:  MQL is pronounced “mickle”.  So now I can talk about it and sound like an insider.  Robert Cook, their co-founder, has a penchant for physics metaphors, which is right after my own heart.  Their database is a custom job written in C.  Now that’s the kind of job a lot of hard core software engineers would love.  One of our guys, Phil Schanely, an old associate from my consulting days, wrote his own schemaless JSON-based database in Java last year and open-sourced it.  Its fairly simple, but very flexible and useful for certain tasks, and you can run server-side JavaScript.  One has a lot of options these days for storing and indexing data.

We also had a few presentations in the evening as part of a Semantic Web meetup.  That was our third.  Unfortunately we sublet some of our space to other companies, and one of them has grabbed a good deal more space in our front area, so the large (50+) Meetups are likely over.  Although we still may be able to fit in smaller ones.  But the extra income is nice too.


50% of Twitter Posts aren’t Useless

August 17, 2009

I saw an interesting headline on DrudgeReport about Twitter being 40% pointless babble.   But the real news is that 50% aren’t pointless babble, spam, or self-promotion.  That’s pretty good signal-to-noise.


The Fallacies of Arnon Mishkin

August 14, 2009
Arnon Mishkin wrote an interesting piece about the fallacy of the link economy (http://paidcontent.org/article/419-the-fallacy-of-the-link-economy/).  Here’s one of the few pertinent facts cited in his article:
The vast majority of the value gets captured by aggregators linking and scraping rather than by the news organizations that get linked and scraped. We did a study of traffic on several sites that aggregate purely a menu of news stories. In all cases, there was at least twice as much traffic on the home page as there were clicks going to the stories that were on it.
There are a few fallacies in there.  First is the notion of value, and that phrase, “vast majority”.  Value online means selling ads, and there are premium ads, and there are remnant ads.  Aggregators get low-value remnant ads, but publishers get premium ads in some cases.  A page view for a publisher is on average more valuable than a page view for an aggregator.  How much?  The number Jarvis and his group at CUNY came up with was $5-7 RPM for a small publisher, , and for an aggregator you’ll get less.  In the case of Digg, say around $2 RPM based on numbers from Silicon Alley Insider (http://www.businessinsider.com/2008/12/diggs-miserable-business) at 400 million pageviews per month.  That’s probably on the high end for aggregators, most are getting less.  Lets say publishers have three times the revenue per view than aggregators, which I think is conservative.
Second, there is his claim that “twice as much traffic on the home page as there were clicks going to the stories”.  This is misleading.  He’s comparing pageviews with visitors.  But a visitor generates at least one and often more pageviews.  Lets say each visit leads to an average of 4 page views, that’s about average for news publishers.  Cranking through the numbers, that’s 4*(1/2)*3 = 6 times more revenue for publishers than for the aggregator.  So where is his “vast majority”?
There’s another fallacy buried in there:  that this is a zero-sum game between aggregators and publishers, and demand is constant.  You might conclude from the above numbers that aggregators are taking one-seventh of the revenue away from news publishers.  But demand for news is elastic.  Its conceivable that aggregators are driving more traffic to publishers than they would get without them.  Not likely in my opinion, but the point is that if they’re taking one-seventh of the revenue, they’re probably creating some new revenue as well by increasing demand for news.
So Arnon’s facts just don’t support his claims, and his reasoning is flawed.  But lets not stop here.  Look at the balance sheet of aggregators.  They’re not getting rich, and perhaps only a few fools went into this industry thinking it would be lucrative.  I certainly didn’t have any illusions about lucrative pay.  Some companies have had lucrative buyouts, but then some handsome fees were paid to purchase newspapers a few years ago.  Everyone industry has their bubble.  You can point to Google, but Google makes their money elsewhere, not aggregating news.  They’re barely running any ads.  Digg as I posted before isn’t doing well, although probably because their costs are out of control. (http://www.businessinsider.com/2008/12/diggs-miserable-business)
All that being said, I still agree with his final three points.  However reclaiming value from aggregators isn’t going to help them much.  They need subscribers and a pay wall.  Not an iron curtain, but a permeable pay wall along the lines of the Wall Street Journal.  There’s no pot-o-gold out there in the hands of aggregators to help you pay for all that good journalism.

Arnon Mishkin wrote an interesting piece about the fallacy of the link economy.  Jeff Jarvis already responded in detail, but he’s a bit too congenial.  I’m a numbers person, so I’m more blunt.  Here’s one of the few pertinent facts cited in Arnon’s article:

The vast majority of the value gets captured by aggregators linking and scraping rather than by the news organizations that get linked and scraped. We did a study of traffic on several sites that aggregate purely a menu of news stories. In all cases, there was at least twice as much traffic on the home page as there were clicks going to the stories that were on it.

There are a few errors in there.  First is the notion of value, and that phrase, “vast majority”.  Value online means selling ads, and there are premium ads, and there are remnant ads.  Aggregators mostly get low-value remnant ads, but publishers get premium ads in some cases.  A page view for a publisher is on average more valuable than a page view for an aggregator.  How much?  The number Jarvis and his group at CUNY came up with was $5-7 RPM for a small publisher.  For a major aggregator like Digg, its around $2 RPM based on numbers from Silicon Alley Insider, at 400 million pageviews per month.  That’s on the high end for aggregators, most are getting less.  So lets say publishers have three times the revenue per view than aggregators, which I think is conservative.

Next, there is his claim that he saw “twice as much traffic on the home page as there were clicks going to the stories”.  This is misleading.  He’s comparing pageviews with visitors, and those aren’t equal.  A visitor generates at least one and often more pageviews.  Lets say each visit leads to 3 page views, that’s about average for news publishers, although you might argue that traffic from aggregators is less likely to stick around.  Also, news outlets generate their own traffic, it doesn’t all come through aggregators.  For the NYTimes about half comes from other referers, only some of which are aggregators.  So there’s another factor of two.  Cranking through the numbers, that’s 3*(1/2)*3*2 = 9 times more revenue for publishers than for the aggregator.  So is that a “vast majority” of the value?  To me a majority is more than 50%, lets peg a “vast” majority at somewhere in excess of 75%.  Even allowing for some errors, and I’d have to be off by a lot, aggregators aren’t getting anywhere near 75% of the revenue from online news.

There’s another fallacy buried in there:  that this is a zero-sum game between aggregators and publishers, and demand is constant.  You might conclude from the above numbers that aggregators are taking one-tenth of the revenue away from news publishers.  But demand for news is elastic.  Its conceivable that aggregators are driving more traffic to publishers than they would get without them.  Not likely in my opinion, but the point is that if they’re taking one-tenth of the revenue, they’re probably creating some new revenue as well by increasing demand for news.

So I don’t buy Arnon’s argument.  But lets not stop there.  Look at the balance sheet of aggregators.  They’re not getting rich, although perhaps a few fools went into this industry thinking it would be lucrative.  I certainly didn’t have any illusions.  Some companies have had lucrative buyouts, but then some handsome fees were also paid to purchase newspapers a few years ago.  Every industry has their bubble.  You can point to Google, but Google makes their money elsewhere, not aggregating news.  They’re barely running any ads.  Digg as I mentioned earlier isn’t doing well, although probably because their costs are out of control.

All that being said, I still agree in principle with his final three points.  However reclaiming value from aggregators isn’t going to help publishers much.  They need subscribers and a pay wall.  Not an iron curtain, but a permeable pay wall along the lines of the Wall Street Journal.  There’s no save-my-business-model pot of gold out there in the hands of aggregators to help you pay for all that good journalism.


Gartner Hype Cycle Heading Towards Trough of Disillusionment

August 13, 2009

I was reading a post on the latest Gartner Hype Cycle for emerging technologies, and it occurred to me that Hype Cycles are heading towards the  Trough of Disillusionment, and don’t actually have a Plateau of Productivity.


The WSJ’s Permeable Pay Wall, Part 2

August 12, 2009

I dug up some additional information on their pay wall, from an excellent interview with Nieman Journalism Lab back in April.  It explains some of their method behind my observations in an earlier post:

  • Politics, arts, opinion, and breaking news are all free
  • Very popular articles are free
  • Exclusives that will just be repeated elsewhere (“WSJ is reporting…”) are free

The WSJ’s Permeable Pay Wall

August 12, 2009

I was reading a post from Jeff Jarvis on Rupert’s pay wall, when it occurred to me that not everyone knows all of the details of how it works.  It’s not as hard of a pay wall as the Times Select put up a few years back.  Both walls cover only some of the content, but there are a few situations where the WSJ lets anyone in.

The WSJ content is either flagged as Subscriber Content, which has a small key next to the headline, or is open to anyone.  However Subscriber Content is available to anyone if you are referred from one of the following sites:

  • Google (news or search, anything Google.com, but not GMail)
  • MySpace
  • Digg
  • Marketwatch
  • Barrons (online.barrons.com)

I checked a host of other aggregators and search engines:  Bing, MSN, Yahoo, ShashDot, StumbleUpon, Twitter, Mixx, Newsvine, Facebook, Fark, Reddit, Drudge Report, and so on.  Nothing else hit.  MySpace, Marketwatch, and Barrons are easy to explain, they’re all News Corp. properties.  The New York Post, and other News Corp. properties, don’t get a pass.  Digg is probably the lone aggregator because it drives a lot of traffic, and was head-and-shoulders above similar sites back when they set the pay wall rules.  Those rules could probably use some revisiting.  Why Google but not Bing?  They don’t have a special relationship with Google, so I don’t see a reason to discriminate.

Occasionally you will see something labeled Subscriber Content which has been opened to everyone.  For example, I found a popular WSJ article on Twitter that is labelled Subscriber Content which can be read in full even if you aren’t a subscriber.  How that happens I’m not sure.  Perhaps if a subscriber shares a link for some protected content, they might decide to make it free, either automatically or by alerting a human.  The Drudge Report frequently links to WSJ articles, but surveying a few days of links I couldn’t find examples of Subscriber Content being linked to.  Although I would wager that if Drudge did link to Subscriber Content, they’d open it up.

So the WSJ has a permeable pay wall.  Jeff’s point about not getting as much Googlejuice is a good one, but I doubt the WSJ is losing out on much.  They’ve probably had a high rank from the day Google launched, and enough links around the web that their rank won’t be falling anytime soon.  They certainly loose out on some traffic by not giving bloggers and newer social networking sites the same pass as Digg and MySpace.

How much are they losing?  Who knows.  Looking at referer data from compete.com and alexa.com, they don’t look too different from their peers.  Their traffic has been steadily increasing.  In fact, they’ve done a better job at adding pageviews than the NYTimes, which is down based on Alexa stats, and flat based on compete.com.

What does that tell me?  There are other important drivers of web traffic than just whether you have a pay wall.  Hmm, maybe it has something to do with quality, and Rupert is on to something.


A Peek at the Kosmix Backend

June 26, 2009

As you probably know I’m always looking at how companies in our space do their stuff.  A few days ago Ted Dziuba provided some interesting information on the Kosmix backend, which I found through Josh Young’s nice “What I’m Reading” feed:

…Kosmix… wrote its own data store in C++.  It’s basically a clone of Google’s GFS…

After I edited out the snark and obscenities, that was all that’s left of his post.  The rest of it, calling them a search engine, would only rankle the Kosmix co-founders.  Dziuba is an interesting read, and he has a good sense of humor.  But he’s never going to make it as a pro linebacker.  He’s just too small and weak.  He needs to spend less time at the keyboard and more at the gym, and get his hands on some steroids.


Cheap EVRI Knockoff on the Huffington Post

June 25, 2009

Is it just me, or does the connections widget on the left column of this page look suspiciously like EVRI’s widget.  When I saw it, I thought it was EVRI’s work.  The colored circles, the topic name in a box on top of the circle, lines between them, the size ratio, the line color, all the same.  But there’s no Evri branding, it doesn’t make a call to the Evri site to fetch data as the WashPo installation does.  So I don’t think its EVRI.  They’re using OpenCalais for local news, could be their tagging, or it could be human tagging with a simple database to power the widget.  Who knows, but it would be better if they were a bit more distinctive in the presentation.  It is confusingly similar to EVRI’s widget.

Judge for yourself.  Here’s a screen shot of the HuffPo widget:

huffpo-evri-widget

And here’s the Evri widget as it appears on their site:

evri-widget

Samuel Clay, one of the crew here at Daylife, says EVRI uses SVG/Raphael, and the HuffPo is just images, so it is a different technology and almost certainly not EVRI’s work.


Harvard BSchool Twitter Study and Bad Statistics

June 5, 2009

Harvard Business School put out some interesting statistics back on June 1.  They raise some interesting questions, but the statistics have some issues, so they should be taken with a grain of salt.

The first is their comparison with Wikipedia editors.  They state that:

…the pattern of contributions on Twitter is more concentrated among the few top users than is the case on Wikipedia, even though Wikipedia is clearly not a communications tool. This implies that Twitter’s resembles more of a one-way, one-to-many publishing service more than a two-way, peer-to-peer communication network.

However the comparison is based only on Wikipedia editors, not those that use Wikipedia.  Just as their study indicates many on Twitter merely follow, many Wikipedia users only read articles and do not contribute as editors.  A fair comparison would exclude some fraction of Twitter users.

The gender biases are also interesting.  However, they limit the gender study to “strongly gendered names”.  Furthermore, as a coworker of mine pointed out, women may be more unlikely to expose their real name online.  So their gender sampling method may be biased.  Would it invalidate their conclusions?  Perhaps not, that would require a bias of at least the order of the gender bias they measure (10% or so).  But they don’t indicate what fraction of users could be assigned a gender, nor did they investigate what fraction of the population can be gender-typed using their list.  The latter would be easy to do with census data.

Studies like this are never perfect, but I generally expect those that conduct studies to call out possible biases in their sampling, and report more fully on the data and their methodology.  I would discard their conclusions about Twitter as a one-way publishing platform as unsupported by their data, and would take their gender bias numbers as a rough estimate with potentially significant biases.


Newpapers Considering the ASCAP Model For Online Payments

June 4, 2009

This morning the WSJ reported that at a private meeting of newspaper executives last week, one of the models they considered for extracting revenue online was ASCAP, the American Society of Composers, Authors, and Publishers.  (Page A11, or online here).  That’s great, but there are a few technical considerations, and you can do a lot better than ASCAP.  So I thought I would weigh in with some concrete suggestions about how such a system might be implemented so that it is accurate and fair.

First, some background.  I was the Director of Research at MediaSentry, where we provided music companies with intelligence about online piracy, attempted to interdict or frustrate would-be downloaders, and collected evidence for civil actions.  So I know quite a bit about monitoring and protecting online content.  Also, through a composer I’m acquainted with, I’ve heard a good deal about the drawbacks of ASCAP. 

ASCAP pays composers according to a formula based on the amount of play-time for their songs.  That play-time includes things such as bars, and just about anything except humming it in the shower.  On the other end of the cash flow, bars, radio stations, television programs, all pay into ASCAP based on estimates of the size of the audience.  The tricky part is figuring out how much each composer gets paid. That requires a lot of sampling, since your neighborhood bar doesn’t report what songs they’re playing.  That just requires listening to some radio stations and spending time in bars, or getting a few of them to self-report.  The sampling is such that small composers tend to get nothing, because they are below ASCAP’s threshold of statistical significance.  

How do you build a system that is accurate and fair?  How would sampling work for online news, and how would you price membership for participants?  You just can’t get reliable pageview information.  You could ask for it, but it would have to be taken on good-faith, and that’s no way to determine payments.  You can’t even get reliable traffic estimates for a site.  Alexa and Compete.com aren’t accurate enough, and won’t distinguish between a site’s own content and content they’ve taken from elsewhere.  There’s the Nielsen Ratings system, but read the criticisms on their Wikipedia entry before you seriously consider replicating their sampling method.  

One practical system would be to put a callback on every article with an identifier for the site.  That callback reports to the newspaper’s version of ASCAP that the article has been viewed.  Its basically the same as how Google Analytics works.  It can be defeated client-side, by simply blocking the request, but there is no incentive for consumers to do so — it only impacts payments between other parties, and does not interfere with their web experience.  Sampling would not be required, as nearly complete information is available.

There are a few loop-holes.  How would you count full-text RSS feeds?  Triggering the same callback method for every feed request would overcount, and there is no mechanism for triggering callbacks in the reader.  There are many different types of RSS consumers.  Here I think you would make a different callback and discount RSS ingestion based on some agreed-upon factor, say 10%.  One would need to run a study to determine, given a certain number of RSS feed requests, how many result in reading or browsing an article.   What about other devices?  The Amazon Kindle?  Here again, you need a separate type of callback, with its own factor.  API’s?  Easy, just do a callback server-side whenever an article is fed out, but this is less easy to verify.  

The newspaper version of ASCAP also needs verifiability.  This is relatively easy.   Sites can be crawled, and the presence of a client-side callback function verified.  In the same way, you can look at the source of any web page and determine whether they are using Google Analytics — the javascript code is in plain view.  For API’s and RSS feeds, its not as easy to verify.  However one can do a simple statistical analysis using a method similar to how lock-in amplifiers are used to extract signals in the presence of even very high noise (other users).  Make RSS or API calls yourself with some fixed modulation, at a level far below their overall usage.  That it can be so far below is the magic of the lock-in amp.  Confirm the presence of the same signal in the callbacks you receive.  Its relatively easy to implement, and the statistics are bomb-proof.  Do it randomly, and relatively infrequently, to verify compliance.  For full-text RSS and API methods of delivery, you’ll need to warn partipants to make the server-side callback within a second of the request, and not cache that request.  Yes, its being a bit anal, but just the presence of a verification mechanism will keep everyone in line. 

With all of that readership data in hand, you can both accurately bill participants in the system, and accurately pay content owners.  It doesn’t have the sampling issues of ASCAP that tend to short-change smaller contributors.  You could even bill based on actual usage, instead of signing a contract upfront and calculating fees with an obscure formula, as ASCAP does.  

In short, this system seems to have some promise, and might in fact work.  Its accurate, easy to implement, and fair to all content owners and consumers.