DotSpots second thoughts

September 10, 2008

The earlier post on DotSpots was probably on the harsh side.  Its a good idea.  Some thoughts on improving:

1) Allow users to pivot on statistically significant terms.  I seen an annotation that involves George Bush and Waterboarding, are there other annotations that match this?  Can I see recent and/or popular ones for GWB, or for Waterboarding, or the combination?  This can be accomplished with a fuzzy match on the annotated paragraph.  Pass the paragraph into a full text search index, write your own ranking function that includes recency of the retrieved paragraph and its activity (number of posts) normalized by site activity when the retrieved paragraph was first annotated (you’re going to grow, that should be factored in).  Collect the results, and find matching terms.  That will help you pull out Bush and Waterboarding as terms that are commonly associated with annotating, and you can throw up “Bush”, “Waterboarding”, and “Bush AND Waterboarding” as options for viewing other annotations and paragraphs.  That would be a nice feature.

2) The goal is nice, but they need a solution for surfacing good annotations, and displaying a broad range of them.  Simple voting scheme should be good.

3) They could easily find themselves dominated by a particular perspective.  I don’t think there’s a good solution.  Unless you start tracking the origin of the annotation, and voting patterns, and can figure out e.g. what someone on gopnow.com will find a useful annotation.  But that’s tricky, and might just confuse people.  I think you just have to accept that you might get stuck in a niche.  I’d encourage them not to play into partisanship as they did at TechCrunch and stay above the debate, unless they want to target a particular niche.

Otherwise, its a good idea.  Very early phase, so not bad for a first peek.


DotSpots Presentation at TechCrunch50

September 9, 2008

A tight group of developers presented DotSpots at TechCrunch50.  Watch the video

Note that for what follows, I’m only going off of the presentation.  So how does it work?  Mostly by mashing together buzzwords.  The idea is to allow users (wisdom of the crowds) to annotate paragraphs, and have those annotations linked (semantically) to other publications with a similar paragraph.

Once the article is diced into paragraphs, they probably maintain a full text search index or hash-based index on their end that allows comments to be matched to similar or identical paragraphs.  On the surface, very easy, although there are some fine points.  If you’re just matching idental paragraphs, you can tie annotations to a hash of the paragraph and index them based on the hash.  This requires the paragraphs to be exactly the same, although you can accomodate some differences by for example stripping white space, coverting to lower case, and so on.  That should work for articles syndicated across various outlets.  You can do fuzzier matching, using say the Levenshtein edit-distance function, or just a simple tf/idf calculation.  It can get tricky as paragraph bodies grow larger.  How similar do they need to be?  A couple of characters?  A word or two?  I would just use a cleaned version of the paragraph and require an exact match.  That will be fine for most syndicated content.  For retrieving very similar paragraphs, I would use one of the open source search engines, and pass in an OR’d list of terms, having it find the most similar paragraphs, then you can check if they meet whatever criteria you want.  If you want exactly similar, I would clean the paragraphs and index them under a hash value.  The similarity measure could be a simple tf-idf, or use n-grams, or even a multi-spectral method.  I would just go with tf-idf (unigrams), no clear benefit going to multi-term groups.  A separately weighted similarity measure for entities might help.  None of this however would fall into the category semantic analysis.

In fuzzing out paragraph matches, you could also pass each paragraph through a sentence chunker and calculate a similarity measure for sentences from each paragraph.  The sentence chunking is a slight twist on the methods I mentioned above.  It would accomodate editors that remove or add a sentence here or there, or alter where paragraphs break.  All of this is can happen when a publication prints an article syndicated through the AP.  I would not call that semantic analysis either.  The benefits would be marginal.

You can do trickier things, like linking annotations based on semantic similarity.  Semantic similarity of two paragraphs is a measure of how much information they have in common.  Thus “George Bush vetoed the Waterboarding Ban” and “The ban on Waterboarding was vetoed by President Bush” have a high similarity measure.  Most interesting assertions in the news will use named entities (e.g. George Bush, President Bush).  If you want the two statements I just gave to get tied toegether, recognizing various name forms (President Bush, George W. Bush), and disambiguating between those with the same name (Mike Wallace the journalist, and Mike Wallace the NASCAR driver), is an imporant and fairly involved first step.  From there you have to start parsing sentences and looking at word synonyms, extract and abstractly represent the information contained in the paragraph.  But if I attach an annotation to a paragraph, who’s to say what particular assertion or aspect of the paragraph I’m referring to?   Why not annotate particular assertions?   

It doesn’t look like DotSpot is doing any semantic analysis, based on their demo, and if so that would be a good call on their part.  As they stated, they’re just annotating paragraphs.  They probably chose paragraphs because they’re very easy to identify and are more specific than just a sentence, and it will make the UI less cluttered.  Paragraphs have some whitespace at the end, great spot to put that little link for dotspots annotations.  I would do an exact match on a cleaned version of the paragraph, index each paragraph based on the hash.  For retrieval, i would look up the hash of the paragraph and possibly for each potential sub-paragraphs based on sentence-chunking results, if that played well.  It will be fast.  Semantic analysis will be noisy, expensive, and not much better if at all.  

I suspect there’s nothing semantic about their business.  I probably shouldn’t care too much, the term has been so thoroughly debased that its far too late to stop the process.  I find it ironic that supposed practitioners of semantics and so forth have difficulty with the semantics of “semantics”.

As for other aspects of the business, I see a few issues.  Issue one, his problem of wanting to comment on all issues of an article is a problem with syndication and the AP’s model of doing business.  Read about it from Jeff Jarvis.  We need to fix the underlying problem.  I’d like to see sites publish their own original content, and link elsewhere when needed.  Echoing the same article to several hundred outlets and stripping off the name of the author and original publisher is not the best scheme for the internet.

Issue two, not everyone one will like comments globalized in this fashion.  His example of annoting an article on GOPUSA.com seems great, he can push the pro-Obama video onto their site.  I think GOPUSA.com would not like their site hijacked by liberals.  I think they would prefer to get comments from their own readership, biased though they be.  That’s fine with me, the internet is a big place.  How do you figure out which annotations to surface?  Voting?  Some universal reputation system?  Yuk.  Every viewpoint and perspective deserves a place to congregate without forced intermixing. 

Issue three, a whole column is a lot to give up, with little control over what appears.  

Issue four, they claim the publisher has complete control, but apparently they don’t understand the completeness of control some publishers exert over user interactions, for example by removing offensive posts, blocking users, establishing filters, requiring varying degrees of registration, and fine control over layout.  

Issue five, the fellow giving the demonstration has a clear political agenda, an obvious bias towards Obama, and wants his company to contribute towards world peace.  Playing to the emotions of the audience with the notion of pushing a liberal opinion onto a conservative web site is a fine rhetorical technique I suppose, but it makes me question their objectives and their whether they clearly understand the online social landscape.  Coming from Bizrate and ShopZilla, it might seem the expressions of political opinion that would appear in such annotations can be easily aggregated accross all sites, but the social news landscape is somewhat messier than ratings on basement dehumidifiers.

—(edit)—

The above is a bit harsh, i posted some second thoughts and constructive ideas.