DotSpots second thoughts

September 10, 2008

The earlier post on DotSpots was probably on the harsh side.  Its a good idea.  Some thoughts on improving:

1) Allow users to pivot on statistically significant terms.  I seen an annotation that involves George Bush and Waterboarding, are there other annotations that match this?  Can I see recent and/or popular ones for GWB, or for Waterboarding, or the combination?  This can be accomplished with a fuzzy match on the annotated paragraph.  Pass the paragraph into a full text search index, write your own ranking function that includes recency of the retrieved paragraph and its activity (number of posts) normalized by site activity when the retrieved paragraph was first annotated (you’re going to grow, that should be factored in).  Collect the results, and find matching terms.  That will help you pull out Bush and Waterboarding as terms that are commonly associated with annotating, and you can throw up “Bush”, “Waterboarding”, and “Bush AND Waterboarding” as options for viewing other annotations and paragraphs.  That would be a nice feature.

2) The goal is nice, but they need a solution for surfacing good annotations, and displaying a broad range of them.  Simple voting scheme should be good.

3) They could easily find themselves dominated by a particular perspective.  I don’t think there’s a good solution.  Unless you start tracking the origin of the annotation, and voting patterns, and can figure out e.g. what someone on gopnow.com will find a useful annotation.  But that’s tricky, and might just confuse people.  I think you just have to accept that you might get stuck in a niche.  I’d encourage them not to play into partisanship as they did at TechCrunch and stay above the debate, unless they want to target a particular niche.

Otherwise, its a good idea.  Very early phase, so not bad for a first peek.


Kosmix part II

September 10, 2008

We discussed in the earlier post how Kosmix might identify related topics.  Another of their nice features, and the one that in my opinion really makes them interesting, is that the third party modules included on the page depend on the topic.  Users also have the ability to indicate whether a particular module, for example photos from Daylife, are relevant to the topic.  (skip to the end where it says SUMMARY if you don’t want all the boring details)

They already have a database of topics and their Wikipedia category.  They also quite likely have the capability to assign modules on a per-topic basis, given that they are allowing users to tweak them.   

How do they decide what modules to include?  One could image setting up several initial module mixtures for political figures, movies, companies, and so on, and making a preliminary assignment to each topic. Since most of their topics are mapped to Wikipedia entries, that would be an easy way to bootstrap module assignment, since using Wikipedia’s categories reduces the size of the problem greatly.  Their categories aren’t great, its not intended to be a taxonomy or heirarchy, but could save some work.

You could also base it on attributes of the Wikipedia entry.  Even if the categories in Wikipedia don’t exactly match up to what you want, the entries have a uniformity that makes other types of classification easy.  People for example almost always have a birth date specified in the first paragraph which is easy to pick out.  It would also be easy to hand-craft rules to identify movies.  A few dozen of these, and you’re probably looking good.  

There are other ways of doing this of course, well into the region of diminishing returns.  Since, as per the previous post, they likely have a set of documents associated with the topic to facilitate exposing related topics, you might consider using a machine learning technique to identify various types of topics.  This could be trained using high-reliability assignments from Wikipedia, or through more manual means.  For example, if you were able to identify 500 entries that you knew were musical bands, you could use that information to train a simple classifier and make additional assignments.  But I would skip all of this machine-learning stuff.  A more direct question is, does the module have anything interesting for the topic?

As discussed before, their index might be relatively small and focused on certain sites.  Say IMDB is in their index.  If a topic hits with IMDB and scores high, identify it as a movie or actor, or just have a separate profile of modules for anything that hits with IMDB.  But most of the modules aren’t sites per se, but other search engines.  You wouldn’t build up a search index for MeeHive.  Assigning categories based on your index doesn’t seem like a great idea either, too noisy.  So I don’t think they’re doing any of this.

Here is a simple method.  Look at the results coming out of the third party searches, and use that to determine whether to include the module.  If you search YouTube for “george bush flagitious” and come up with nothing, or with low-quality video (old, few views, low ratings) don’t include the module.  The HTML being served up by their site already has the results baked into it — they aren’t scripts or widgets that populate separately.  So you can throw 50 modules onto the George W. Bush page, give each of them 1 second to respond, and display whatever gets back to you in time and seems to be of high quality.  Cache the page for a few minutes, so that popular ones load quickly.  It also gives you a way of populating modules intelligently even if someone searches for “george bush flatigious”.

That sounds easy, but not really.  Not all services will report a relevance score.  YouTube is easy, they’ll report number of views and ratings, and age.  So figuring out whether a module is returning good stuff might be difficult.  You can at least scan for the terms you put in.  If you can’t get a handle on relevance, you can push them further down on the page.  The George Bush topic page has a module for How-To videos at the bottom.  Not a huge deal, maybe someday users will help you clean it up.  You also don’t want the third party services to start hating you.  The cache will help there, but you might also want to record how often the results seem relevant, and use that to determine whether to even call it in the first place.  If How-To videos for Bush seem lousy, stop asking for them, but give them another chance in a day or two.

One other important data point, the Microsoft topic page has a couple of financial news outlets on it.  These outlets will also surely serve up hits for George Bush, and just about anything else, but they seem to be isolated to company topics.  So I’m fairly confident they are classifying topic pages and using that as a guide for module selection.  There’s also the issue of ordering the modules.  The Microsoft topic page again is very nice, has Wikinvest at the top, and other financial news services.  The Britney Spears page has last.fm near the top.  That’s probably not all on-the-fly based on what the modules are returning.

Their RightHealth site, which is also nice, is another example where the modules are a bit more selective.  I actually like it more than their main Kosmix site, cleaner and the modules are all appropriate.  But then its a smaller set of topics, and so easy to curate.

How do you handle searches?  If you search for “Microsoft Google” you don’t get all of those nice financial news sources.  It looks like they have a set of generic modules to fall back on if its a search as opposed to a topic.  That I think is an area for improvement.  At the least you could do string matches against your topics, and pull in modules that are indicated for either of them.  

SUMMARY

So what’s my best guess here?  What would I do?  Set up one or two dozen classifications for topics, and make preliminary assignments using hand-crafted rules referencing Wikipedia categories and entries.  I wouldn’t spend too much time doing it, or try anything tricky.  Perhaps just identify people, companies, and “other” as a broad backstop where a more fine-grained assignments aren’t necessary.  Have topic-specific module settings and ordering, set initially by classification and global orderings.

Develop an estimate of relevance for each of the third-party modules, and monitor it for every call.  The topic classifications seed what modules are used initially, and their order.  When a topic page is requested, call out to all of the modules, given them a second or two to respond, and prune ones that seem to have low relevance.  Stop asking them for a while if relevance is consistently low.  Monitor the modules closly to see if they’re low performers across the board, and if so reasses how you’re setting their bar for relevance.  From there, user input drives it.  For searches, scan the query string for topic names and use that as a guide for including modules.

There are a few too many modules for my tastes on most topic pages, and the ones towards the bottom of the page can get kind of weird, like how-to’s for George Bush, but its also a fairly new site and no doubt will improve.


DailyMe site review

September 10, 2008

DailyMe is another new news site I thought I would chime in on.  Wow, these are coming up like weeds.  

They serve up syndicated content, and add some tools to customize article selection and delivery.  I would put them in the same category as Newsvine, although that company is more community-oriented.  In the business there are those of us that display content, and those of us just show excerpts and link to it.  Daylife is in the latter category when it comes to articles, but in the former when it comes to images.  

DailyMe has a very nice interface for customizing article selection.  They’ve broken the news into a few dozen categories, organized in a heirarchy two levels deep, and you can pull from any level in the heirarchy with the option of adding keywords.  Its one of the best ways of doing this that I’ve seen.  Not hard to do on the backend, I won’t even bother discussing, the site is mostly about the UI and not data analysis.  The meme-it feature, that allows you to indicate whether the article is insightful/enlightening/tragic/weird/humorous/uplifting.  Kind of lame.  Those are not among the bag of adjectives I select when describing news articles, but then my vocabulary is notoriously skewed towards the negative.  But they’re trying something new.  The feature is already there, I would give it a month or two, then cut it if it doesn’t get much usage (and I doubt it will).  Whether or not the nice content selection interface is enough, I’m not sure.  Its only a marginal improvement on what’s out there now, and marginal improvements sometimes aren’t enough to overcome adoption barriers when others are already established in the marketplace. 

What’s more interesting is their Publisher Services, which they’ll be demonstrating at ONA08.  That’s more of an open market than attracting viewers.  From their description, “licensing the same technology and content”, seems like white-labeling their site with the ability to add your own.  Perhaps a good play towards small news outlets, like your hometown rag or radio station, that want to display full articles.  Its better than building a third-rate portal.  So they’re a somewhat different market than Daylife, although there is a small amount of overlap.  When I learn more, I’ll post it.  I wish however this whole syndication model would go away, and publishers would put up their own articles and link to others when they need to.  But I’ll let more competent folk rant about that.