DotSpots Presentation at TechCrunch50

A tight group of developers presented DotSpots at TechCrunch50.  Watch the video

Note that for what follows, I’m only going off of the presentation.  So how does it work?  Mostly by mashing together buzzwords.  The idea is to allow users (wisdom of the crowds) to annotate paragraphs, and have those annotations linked (semantically) to other publications with a similar paragraph.

Once the article is diced into paragraphs, they probably maintain a full text search index or hash-based index on their end that allows comments to be matched to similar or identical paragraphs.  On the surface, very easy, although there are some fine points.  If you’re just matching idental paragraphs, you can tie annotations to a hash of the paragraph and index them based on the hash.  This requires the paragraphs to be exactly the same, although you can accomodate some differences by for example stripping white space, coverting to lower case, and so on.  That should work for articles syndicated across various outlets.  You can do fuzzier matching, using say the Levenshtein edit-distance function, or just a simple tf/idf calculation.  It can get tricky as paragraph bodies grow larger.  How similar do they need to be?  A couple of characters?  A word or two?  I would just use a cleaned version of the paragraph and require an exact match.  That will be fine for most syndicated content.  For retrieving very similar paragraphs, I would use one of the open source search engines, and pass in an OR’d list of terms, having it find the most similar paragraphs, then you can check if they meet whatever criteria you want.  If you want exactly similar, I would clean the paragraphs and index them under a hash value.  The similarity measure could be a simple tf-idf, or use n-grams, or even a multi-spectral method.  I would just go with tf-idf (unigrams), no clear benefit going to multi-term groups.  A separately weighted similarity measure for entities might help.  None of this however would fall into the category semantic analysis.

In fuzzing out paragraph matches, you could also pass each paragraph through a sentence chunker and calculate a similarity measure for sentences from each paragraph.  The sentence chunking is a slight twist on the methods I mentioned above.  It would accomodate editors that remove or add a sentence here or there, or alter where paragraphs break.  All of this is can happen when a publication prints an article syndicated through the AP.  I would not call that semantic analysis either.  The benefits would be marginal.

You can do trickier things, like linking annotations based on semantic similarity.  Semantic similarity of two paragraphs is a measure of how much information they have in common.  Thus “George Bush vetoed the Waterboarding Ban” and “The ban on Waterboarding was vetoed by President Bush” have a high similarity measure.  Most interesting assertions in the news will use named entities (e.g. George Bush, President Bush).  If you want the two statements I just gave to get tied toegether, recognizing various name forms (President Bush, George W. Bush), and disambiguating between those with the same name (Mike Wallace the journalist, and Mike Wallace the NASCAR driver), is an imporant and fairly involved first step.  From there you have to start parsing sentences and looking at word synonyms, extract and abstractly represent the information contained in the paragraph.  But if I attach an annotation to a paragraph, who’s to say what particular assertion or aspect of the paragraph I’m referring to?   Why not annotate particular assertions?   

It doesn’t look like DotSpot is doing any semantic analysis, based on their demo, and if so that would be a good call on their part.  As they stated, they’re just annotating paragraphs.  They probably chose paragraphs because they’re very easy to identify and are more specific than just a sentence, and it will make the UI less cluttered.  Paragraphs have some whitespace at the end, great spot to put that little link for dotspots annotations.  I would do an exact match on a cleaned version of the paragraph, index each paragraph based on the hash.  For retrieval, i would look up the hash of the paragraph and possibly for each potential sub-paragraphs based on sentence-chunking results, if that played well.  It will be fast.  Semantic analysis will be noisy, expensive, and not much better if at all.  

I suspect there’s nothing semantic about their business.  I probably shouldn’t care too much, the term has been so thoroughly debased that its far too late to stop the process.  I find it ironic that supposed practitioners of semantics and so forth have difficulty with the semantics of “semantics”.

As for other aspects of the business, I see a few issues.  Issue one, his problem of wanting to comment on all issues of an article is a problem with syndication and the AP’s model of doing business.  Read about it from Jeff Jarvis.  We need to fix the underlying problem.  I’d like to see sites publish their own original content, and link elsewhere when needed.  Echoing the same article to several hundred outlets and stripping off the name of the author and original publisher is not the best scheme for the internet.

Issue two, not everyone one will like comments globalized in this fashion.  His example of annoting an article on GOPUSA.com seems great, he can push the pro-Obama video onto their site.  I think GOPUSA.com would not like their site hijacked by liberals.  I think they would prefer to get comments from their own readership, biased though they be.  That’s fine with me, the internet is a big place.  How do you figure out which annotations to surface?  Voting?  Some universal reputation system?  Yuk.  Every viewpoint and perspective deserves a place to congregate without forced intermixing. 

Issue three, a whole column is a lot to give up, with little control over what appears.  

Issue four, they claim the publisher has complete control, but apparently they don’t understand the completeness of control some publishers exert over user interactions, for example by removing offensive posts, blocking users, establishing filters, requiring varying degrees of registration, and fine control over layout.  

Issue five, the fellow giving the demonstration has a clear political agenda, an obvious bias towards Obama, and wants his company to contribute towards world peace.  Playing to the emotions of the audience with the notion of pushing a liberal opinion onto a conservative web site is a fine rhetorical technique I suppose, but it makes me question their objectives and their whether they clearly understand the online social landscape.  Coming from Bizrate and ShopZilla, it might seem the expressions of political opinion that would appear in such annotations can be easily aggregated accross all sites, but the social news landscape is somewhat messier than ratings on basement dehumidifiers.

—(edit)—

The above is a bit harsh, i posted some second thoughts and constructive ideas.

3 Responses to “DotSpots Presentation at TechCrunch50”

  1. DotSpots second thoughts « NP-Harder Says:

    [...] second thoughts The earlier post on DotSpots was probably on the harsh side.  Its a good idea.  Some thoughts on [...]

  2. Dan Andersen Says:

    I actually liked dotspots idea. Very different than the ordinary sites out there at TC50.
    The ability of the presenter – I agree – he was tense and could have been a much better relaxed presentation. The guy needs public speaking courses. However, his ability to present should not be mixed with the product. The product in my opinion is solid. I am a blogger and cannot wait to use this tool. Can you imagine, adding my comments on CNN, BBC, etc.
    Also talking about Obama, McCain, etc. in your post above, I have to say that I cannot actually wait to use this tool to place my comments ON these politician sites. On sites that require true voice of the people. Imagine? Type comments onto other people’s sites :) I mean that’s not your regular site and is much different than the rest of the TC50s… games, dead people site, charity, social network, etc. all been there and done that… none totally original.
    The presenter though DOES need help :)

  3. sandrar Says:

    Hi! I was surfing and found your blog post… nice! I love your blog. :) Cheers! Sandra. R.

Leave a Reply