The WSJ’s Permeable Pay Wall

August 12, 2009

I was reading a post from Jeff Jarvis on Rupert’s pay wall, when it occurred to me that not everyone knows all of the details of how it works.  It’s not as hard of a pay wall as the Times Select put up a few years back.  Both walls cover only some of the content, but there are a few situations where the WSJ lets anyone in.

The WSJ content is either flagged as Subscriber Content, which has a small key next to the headline, or is open to anyone.  However Subscriber Content is available to anyone if you are referred from one of the following sites:

  • Google (news or search, anything Google.com, but not GMail)
  • MySpace
  • Digg
  • Marketwatch
  • Barrons (online.barrons.com)

I checked a host of other aggregators and search engines:  Bing, MSN, Yahoo, ShashDot, StumbleUpon, Twitter, Mixx, Newsvine, Facebook, Fark, Reddit, Drudge Report, and so on.  Nothing else hit.  MySpace, Marketwatch, and Barrons are easy to explain, they’re all News Corp. properties.  The New York Post, and other News Corp. properties, don’t get a pass.  Digg is probably the lone aggregator because it drives a lot of traffic, and was head-and-shoulders above similar sites back when they set the pay wall rules.  Those rules could probably use some revisiting.  Why Google but not Bing?  They don’t have a special relationship with Google, so I don’t see a reason to discriminate.

Occasionally you will see something labeled Subscriber Content which has been opened to everyone.  For example, I found a popular WSJ article on Twitter that is labelled Subscriber Content which can be read in full even if you aren’t a subscriber.  How that happens I’m not sure.  Perhaps if a subscriber shares a link for some protected content, they might decide to make it free, either automatically or by alerting a human.  The Drudge Report frequently links to WSJ articles, but surveying a few days of links I couldn’t find examples of Subscriber Content being linked to.  Although I would wager that if Drudge did link to Subscriber Content, they’d open it up.

So the WSJ has a permeable pay wall.  Jeff’s point about not getting as much Googlejuice is a good one, but I doubt the WSJ is losing out on much.  They’ve probably had a high rank from the day Google launched, and enough links around the web that their rank won’t be falling anytime soon.  They certainly loose out on some traffic by not giving bloggers and newer social networking sites the same pass as Digg and MySpace.

How much are they losing?  Who knows.  Looking at referer data from compete.com and alexa.com, they don’t look too different from their peers.  Their traffic has been steadily increasing.  In fact, they’ve done a better job at adding pageviews than the NYTimes, which is down based on Alexa stats, and flat based on compete.com.

What does that tell me?  There are other important drivers of web traffic than just whether you have a pay wall.  Hmm, maybe it has something to do with quality, and Rupert is on to something.


Google to Help Readers Dig Into Stories

May 22, 2009

A close reading of the Financial Times interview with Google CEO Eric Schmit revealed an interesting remark (emphasis mine):

we are very interested in trying to develop online news versions that somehow address the immediate needs of people and for which advertising works better. Without commenting specifically about products it seems to me that the newspaper that I read online should remember what I read. It should allow me to go deeper into the stories. It’s that kind of a discussion that we’re having.

Google will be helping readers go deeper into stories.  That’s a crowded space, so move over everyone and make way for the 800lb gorilla.  We’ll see what they come up with.  I suspect, based on things I’ve heard, that if they do such a thing they’d give it away for free, although that would probably mean sharing advertising revenue.  

Its also probably a naive goal.  Most readers don’t like to go deeper into stories.  Most of them don’t get through the full article.  There’s nothing wrong with that.  I would venture that if more than 20% of readers made it to the end of an article, it should have been longer.  You had their attention, and you probably had more material you could have included.  No, I think most readers instead of going deeper are interested in going sideways.  That’s one reason I find Wikipedia so addictive, there are so many paths that let you skip sideways to a loosely related topic.  But perhaps I’m over-analyzing Schmidt’s arguments.  I sure would like to dig deeper into them.

Even if nothing comes of this, or its still-born, they’re chipping away at their image as an enemy publishers.


Twitter and the Hype Around Real-Time Search

May 18, 2009

I came across an article in CNN on new search engines that aspire to supplement Google.

And sites like Twitter are trying to capitalize on the warp-speed pace of online news today by offering real-time searches of online chatter — something Google’s computers have yet to replicate…  If you search Google news, the results will be recent, but not live. That’s where Twitter’s search comes in. It searches the site’s micro-blog posts by the second, allowing users to see what’s buzzing on the Web at any instant. 

Wow, Twitter/Summize can sort matches by date and return the most recent ones.  First of all, Google’s computers can replicate that trivially, and they already offer real-time search of my Gmail and chats.  Sorting by date, most recent first, is far simpler than looking at relevance and removing duplicates, which is what Google News and their web search is doing.  Its actually trivial to implement.  The only hard part is scaling it up to Twitter size, and that’s only hard in the sense that its a lot of work, since its a well-beaten path.  Summize did it with a very small crew.  

Google does offer the ability to sort-by-date.  However it takes some time for them to acquire content.  Twitter is a bit easier, everyone posts messages to a central system, and the total amount of text you need to index is fairly small.   Back in November they reported about 15 to 20 tweets per second, traffic via compete.com has gone up by a factor of 6 since then, and so I figure they’re doing around 10 million tweets per day.  Call it 100 bytes per tweet, and its around a Gigabyte per day that needs to be indexed.  The inverted index is simple, for every word you keep a date-sorted list of occurrances, along with position to handle quoted quoted phrases.  

They offer a few other options, like user and location, which you’ll also have to index.  You could fit a week’s worth of traffic into memory on a relatively cheap box.  Relatively old data can be kept on disk.  You could store an index of all tweets on one machine, so managing data replication is simple — everyone gets a copy of everything.   At some point you’ll have to partition the data, which you can do by date, or randomly and have them execute in parallel.  

It looks like Twitter has trimmed their search index aggressively.  At the moment, I see no search results prior to April 29.   On April 29 however I start to see things again.  Note that by the time you read this, both results may return nothing.   Searching for “Daylife” in the Daylife feed also stops 18 days ago.  So they seem to have 20 days in their index.  I’ll bet that works out to around 32GB for their index, and they’ve stuffed that much memory onto a single box which they’ve replicated.  Or perhaps a couple of boxes with less memory and the data spread randomly across them.  For older Tweets, they haven’t bothered with a slower disk-based search option, or with the hardware costs necessary to stuff it all into memory.  

It makes sense, they’ve kept it simple to focus on the most important feature: quickly searching only the most recent tweets.  But real-time search is nothing tricky, nor is it anything new.  The only new thing here is the popularity of Twitter.


Google to Designers: Data Talks, Bullshit Walks

March 25, 2009

So some design folks have been quitting Google.  Here’s an interesting complaint from Doug Bowman’s blog:

When a company is filled with engineers, it turns to engineering to solve problems. Reduce each decision to a simple logic problem. Remove all subjectivity and just look at the data. Data in your favor? OK, launch it. Data shows negative effects? Back to the drawing board.

Ahh, designers, I love them and the pretty, pretty, prolix little widgets with crappy data they send around to eachother.  I suppose someone on the other side of the argument might say:

When a company is filled with designers, it turns to design to solve problems. Reduce each decision to a simple gut check. Remove all objectivity, gloss over data problems, and just go with how you feel. Looks good? Ok, launch it. Looks bad? Back to the drawing board.

Its a common argument.  Both are caricatures of course, and the above statements are true of only a very small minority.  I remember once, a company (nameless) I consulted with was hiring a new designer, so the CEO asked me to sit in on a meeting with they guy he wanted to hire, to see what I thought of him.  At some point in the conversation, I mentioned that my four favorite sites were Wikipedia, Craigslist, Drudge Report, and Google Search, all of which were examples of excellent and simple design (that’s right, excellent design), and that Wikipedia had a way of sucking me in with all of the great, connected data.  I could lose hours there.  He didn’t like those sites, said they were poorly designed, and that Wikipedia was not particularly engaging.  I was going to point out Alexa and Compete.com traffic stats, the objective data so to speak, but I started to get another David Banner moment and was about to tear the conference room down to the studs.  So I took a deep breath and shut up, and wrote him off as a nice, well-intentioned, but misguided designer, and not the sort that would ever be detered in his belief.   I was also outnumbered.  

I’m glad to see that Google puts data first, and holds designers to a high standard.  They have the resources to be rigorous.  Are they too strict, as Bowman suggests?  Perhaps, but perhaps not.  There are some interesting stories out there, but Google is a big company, and a few anecdotes from a handful of partisans is not sufficient, I need more data points.  Data talks, bullshit walks. 


Please, Someone Sue Google News in an American Court

March 2, 2009

There is an interesting article in the New York Times today that discusses excerpting by online news sites.  I’ll avoid quoting it heavily, if you want to know what it says, follow the link.

Two of the companies mentioned, All Headline News and the Huffington Post, are giving the industry a bad name.  The former is being sued by the Associated Press.  They claim that AHN does no original reporting and merely rewrites content provided by the AP.  Perhaps in response, AHN is now looking for citizen journalists nationwide.  It all seems questionable to me.  How do we, or indeed AHN, know that these citizen journalists are doing original reporting, and not rewriting other’s content?  What exactly are the credentials of AHN and to what standards are their contributors held?  Will they use citizen journalism as a license to steal, a job where a poor-mans version of Jayson Blair would thrive instead of being censured?  As for the Huffington Post, I’ve already said their excerpting policy seems excessive.

Both of these companies seem parasitic, although the Huffington Post otherwise offers good blogs.   I suppose its up to the content owners to determine whether whatever utility they provide is worth the blood they suck.  It worries me, because Daylife and plenty of other useful companies could get caught in the crossfire.  Not every symbiont is a parasite, so don’t kill all of the insects, just the mosquitoes.

This is one reason I deeply regret that the GateHouse Media vs. New York Times Corp  case didn’t go to court.  Building up good case law requires good cases, and All Headline News is not a good case, nor would the Huffington Post be one.  Nor would the defendents in either case have sufficient funds to retain a top notch legal team, whereas The New York Times Corp did.  The only other deep pocket that might provide a good case?  Google News.  Please, someone sue them in an American court over excerpting.

Don’t take me wrong, I like Google News, and I don’t want them to loose.  I want good case law.


NewsShow: Google Free News Banner

February 4, 2009

Google News just launched a new widget, NewsShow, along with a nice Wizard, that lets you embed headlines into a web page.  Its fairly simple, in Google’s style.  You select some high level topics and a search phrase, and it will slowly flip through the results.  Not a product per se, no advertisements, but its a nice free service, and reliable Google quality.  If it were any other company, probably no one would take notice, but since its Google, it is interesting.  Also, no blogs, or an option to include them, which I still don’t understand.  There seems to be a wall between what they consider to be blogs and what they consider to be news, and it reflects things other than whether it’s a blog.  For example,  some blogs hosted by large outlets are filed under news and are not available in their blog search index.


Google and Search Relevance

January 9, 2009

Daniel Tunkelang gave a nice talk at Google yesterday on search relevance.  He posted the slides online.  I wasn’t at the talk, but know Daniel from some previous run-ins.  

I particularly enjoyed his comparison between Google and McDonalds.  I drew a connection between those two companies over at Jeff Jarvis’s blog a while ago for a slighly different reason.  Jeff is a sharp guy, although he may have gone off the deep end with his new book, What Would Google Do?  I’ll have to buy it and find out.  Those things don’t sell themselves, so one expects to choke down some hyperbole, but alluding to a similarity between Google and Jesus crosses over into burlesque.  Its just a company.  A clever one, for sure, but history is replete with clever companies and the small revolutions they caused.

The comparison between Google and McDonalds on the other hand has some merit.  Both are innovative companies that ought to be admired for some of what they’ve done, but neither can be accused of increasing the public’s appetite for fine food or HCIR (haute cuisine information retrieval).  Twenty or thirty years ago McDonalds was using census data and ariel surveys to determine the best location for franchises — demographics, traffic patterns, and suburban expansion.  Then there’s writing the book on fast food and franchises. 

Daniel has a few criticisms of Google’s search engine.  Consider his example on slide 23, where he searches for “IR” and demonstrates the poor results.  Just about any Googler will know that ” ‘Information Retrieval’ ” would be a better search phrase, and if they don’t they’ll learn it soon enough.  So I think this first example is not quite fair.  The example on slide 31, a search for “Steve Pollitt”, is a better criticism.  He might have added a term for research or publications, but the results would not be much improved.  Or he might do exactly what he did: go to Rexa.info where he can get an experience tailored to academic researchers.  It would be wise for Rexa to allow Google to spider their content, as LinkedIn and ACM do.  I wonder whether the best solution is when Rexa or a site like them appears high in Googles results list, and the entire experience can be tailored to the domain, for example the sidebar that lists co-authors.

So for a one-size-fits-all search engine, as Google’s web search aims to be, I’m not convinced faceted navigation is a good option.  In the case of Steve Pollitt, I don’t trust algorithms to accurately distinguish individuals.  Witness spock.com.  Faceting along other lines might wall me off from good information.  Its a rare name, I may find a personal blog where he rants about the impact of the internet on democratization in developing countries or some such thing, something I could easily skip by if I were to let myself be guided by algorithmically derived facets or even human tagging.  If the results are too noisy, Google is fast enough that adding some constraining terms is a quick process.  Products, such as the ceiling fans and shoes that Daniel demonstrates later in the talk, are simple objects with obvious facets to aid in navigation, and generally I want to navigate among options, so there its a no brainer.  Not so with most of my Google searches.  So I’m happy with its simplicity.  I enter words, it gives me back things that have the words, weighted by a variety of factors including PageRank.

I think however that methods for searching news can benefit from what Endeca has done with some of their clients.   Its a bit of a different beast, since the value of a news article or image declines rapidly with age, and the volume is quite small.  But when searching old news, admittedly a rare usage pattern, the benefit would be substantial.  I was recently searching Google News for old items on Barnard Madoff.  All I got was an ugly list.  Its an area where PageRank is apt not to help much, as with Google’s web search product, so I would have loved the sort of help that faceted navigation could offer.

One of his quotes did however ring a bell with me.  On slide 9 he quotes a Fortune 500 CEO who said that “Search on the Internet is solved”.  It reminded me of a quote from Lord Kelvin in 1900, “There is nothing new to be discovered in physics now.”  That was just when things started to get interesting.  Back then, physics at least appeared to be stable and reaching maturity.  With Maxwell’s equations, one gets a feeling of completeness, that there is nothing more except a reduction to practice.  Perhaps Kelvin can be excused.  So it is interesting that only a few years of looking at the same search page might make one think that search is a solved problem.  More revolutions will come.


Google Blog Search: Not just search anymore

October 2, 2008

Today Google launched a revamped version of its Blog Search, and for the first time, its not just search. They’re surfacing top blog clusters.  There are some interesting reviews here and here.  I have a different take on what they’re doing and why, and I think we’ll start seeing these clusters in a few other places around their web site, like their web search page.

Right now, they have a few high-level categories, and some simple clustering to surface major topics in the blogging world.  The clusters, like this one, are sized well, give you a nice chart, and have the usual blog noise, nothing too surprising.  Clustering isn’t trivial, but they already do a fine job with news, so I’m glad to see them port some of this over to the blog section.  Before this launched, it seemed like blogs weren’t getting much love, and not fully benefiting from the skills of their Google News team.  Its using a similar URL pattern for cluster id’s that Google News uses (“ncl” for news, and “bcid” for blogs — news cluster id, and blog cluster id, and the numbering scheme seems the same).  Blogs are still off in its own sandbox, and it seems that it would not be too hard to link major blog clusters to major news clusters.  I suspect that, relatively soon, we’ll start seeing exactly that:  blog clusters on their News section, and on their main search page.  You can look at blog posts from within a news article cluster, but it seems to have no relation to what’s going on with their Blog search page.

Could they have included blog search results on their main web search page now, and more heavily in their news site?  I suppose, but blog clustering makes it easier.  Much of what Google does incorporates page rank, and blogs and news aren’t as amenable to that sort of treatment, and splogs can gum things up.  Clustering however is a good way to surface things that are new and significant, you just have to pick good representatives from the cluster when you decide on a title and who to link to.  You also have the problem that, in relying on clustering, you will always be lagging others, since you have to wait until the momentum has already developed.  

Right now, if you use Google to search the web for Gwen Ifill, the first result says “News results for gwen ifill”.  But if you search for Geraldo Rivera, it doesn’t have a news slug, and just links to Wikipedia.  Perfect, Geraldo isn’t big in the news right now, but Gwen Ifill is.  How did Google know that?  Clustering!  Now that they have clusters for blogs, they can do the same with them.  But they need to be cautious with their web search, so I suspect they’ll let blog clustering run for a while before incorporating it.  Before that happens, we’ll probably see them rolled out to Google News, perhaps associated with news clusters.  

Significantly, actually searching for blogs does not leverage any of their clustering work.  At the time I’m writing this, the top blog cluster is one on Gwen Ifill, but searching for “Gwen Ifill” doesn’t show the cluster.  So the underlying search seems to be unchanged.  Cluster size however can help a great deal when sorting search results.  Why aren’t they leveraging clusters with their search?  There’s probably a technical hurdle there, or perhaps they just need time to back-process all of the blogs back to 2005.

They also have an API that lets you tell them about a new blog post.  Great.  Its not like they really need it, they’re Google after all.  But its a friendly thing to do, and it makes bloggers feel like they’re a bit more in control and that Google is working with them.


Google InQuotes and Silobreaker Quote Attribution

September 25, 2008

This is something of a followup to yesterday’s post about Google’s InQuotes feature.  I was reading a post from Mark Forscher, who laments lack of access through an API, or the ability to plug in any person.  It occurred to me that at Daylife we opted to package the quotes somewhat differently because it is being served through an API.  You will note that Google displays a fairly good chunk of the excerpt, for example this one:

“At a time of crisis, when leadership is needed, Senator Obama has not provided it,” McCain said Sunday in a speech to the National Guard Association in Baltimore. “We saw the same lack of leadership on Iraq.”

Whereas Daylife would report this as:

“At a time of crisis, when leadership is needed, Senator Obama has not provided it… We saw the same lack of leadership on Iraq.”

A very long time ago, if you went to the Daylife web site, you would have seen something more like the former.  It requires more real estate to display, but has a nice feature that I have sorely missed.  It covers your butt if your attribution is inaccurate.  It also provides a bit more context, in this telling you it was a speech to the National Guard Association in Baltimore.  You could also do an easy but not terribly accurate job of searching attributed quotes by indexing the quote and non-quote text separately, although both Daylife and Google opted for a more sophisticated approach.

You will note that for the Google application, if you look for quotes on “immigration” or “bush”, the words always appear within the quote, and not the non-quote text that they display.  That’s some nice attention to detail for a feature like this.  They’re only indexing the quote, but displaying a broader swath of text.  So they index the quote, and separately store the quote with the context, or perhaps index its position within the full article body.  Similarly if you search for “Depp” for quotes by Johnny Depp, you correctly get nothing.  They’re also probably doing some simple sentence chunking to select better endpoints.  That, for those of you that haven’t tried it, includes more than just looking for periods, exclamation points, and question marks.

For the Daylife API, all you get is the quote, with ellipses joining fragments where appropriate.  That requires an algorithm for bridging quote fragments, not a trivial task for a quote-heavy document, and complicates attribution somewhat.  I’m curious though how others feel about quote-only vs. quote-plus-context.  Certainly it depends on the application, but since the Daylife API is intended to support a wide array of applications, it might argue for supporting both.  On the other hand, one should avoid complicating an API needlessly.  

Another company, Silobreaker, also does quote extraction and attribution.  Their quotes module for Barack Obama is shown below, and has a few problems.  Too much of the surrounding text is included, and isn’t selected as neatly as Google.  Two of the four weren’t actually said by Obama.  They’ve changed the bolded quote block to double quotes, but elsewhere the documents use single quotes, making it confusing to understand the original text.  Perhaps they didn’t intend it as quotes by Obama, and indeed it says “Quotes for Barack Obama”.  He will be glad to hear of their support.  That however is not what the placement of the module calls out for, and what most users would expect.  Lowering expectations in such a way not only means adding language like that, but involves the entire context of the presentation.  So I would put Silobreaker in a distant third for quality of quote attribution.

 

 

Silobreaker.com quotes for Barack Obama

Silobreaker.com quotes for Barack Obama


Google Labs InQuotes Feature

September 24, 2008

The Google News group released a nice new InQuotes feature today.  Its a nice interface.  I wish we had done it here at Daylife, since our API would let you do exactly what they are doing.  

Quote extraction itself is not too difficult a problem, although attribution can be tricky.  If you want low recall, you can look for “‘blah blah’, said Sam Peckinpah”, and simple patterns like that.  The english language however admits to an enormous number of ways to attribute a quote to some, or to a pronoun or name fragment representing someone.  The other thing I’ll say about quote attribution: never get them wrong.  Or almost never.  That’s quote extraction and attribution in a nut-shell.  

Nice interface aside, the interesting thing here is that Google News did it.  Its the fun sort of thing I’d expect from a smaller more nimble company.  Like us.  You can’t take yourself too seriously if you want to put stuff like this out, and you have to weather complaints about excluding candidates.