Its late on a Friday, and I’ve lately been mesmerized by the near collapse of the financial sector, but I thought I would toss out a quick review of AdaptiveBlue and a guess at what they’re doing under the hood. Another entry into the increasingly crowded market of automatically marking up text with links.
The product that is most interesting is their SmartLinks. Automatically marking up blog entries, or decorating links with pop-up information and links, is not particularly easy. Most that I see are doing some sort of entity extraction or topic identification. That can run the gamut from numbingly complicated to numbingly simple. On the complicated end, you can use computationally expensive statistical techniques that attempt to disambiguate different people with the same name, identify nicknames and shortened versions. On the simple end, you can build up a list of names, say from a resource like Wikipedia as a reference, build a giant regular expression, and do string matching. Massive regular expressions can run quickly, say 50 thousand names in a hundred milliseconds or so. In either case, its a noisy process.
AdaptiveBlue has a yet simpler method for decorating these links, and simple is good. They don’t try to identify for example what words represent a book or person. They look at the links a human has already added, and the target of the link. If you link the name of a company to a finance site, they generate a pop-up with simple information and options for navigation. Say someone is linking to stockpkr.com. I prefer Yahoo Finance, so I’m going there. One disadvantage is that it requires another click to get there, but for certain narrow sites I can see an advantage. Movie reviews, music, or financial sites all seem like good candidates.
How do they do it? Take links to financial sites as an example. First identify a set of sites that financial bloggers like to link to. Maybe you come up with a dozen. When you see a link to one of those sites, there are a couple of strategies. Grabbing the anchor text won’t always be illuminating, since you might just link to this lousy, wretched book without mentioning its name. However, in the case of books, for the two sites they support the ISBN can be extracted from the target URL. Given the ISBN, you can pull in related information like an image of the cover and a description from Amazon, and generate links out to other sites. For books the links out need not be based on the ISBN, since you now know with complete certainty the title and author from Amazon. If the ISBN isn’t in the URL, there’s a chance you can still scrape it from the destination page, or set up other extraction algorithms, although I’m not sure if they’re going this far. As a last resort, you can use the anchor text, but as mentioned that won’t always be reliable.
This isn’t terribly earth-shattering. They bust through some domains like music, finance, books, and hand-craft some retrieval rules. ISBN for books, ticker symbols for stocks, with some more general methods to backstop if they prove to be of sufficiently high fidelity. There are many companies out there in the business of marking up or otherwise decorating content with links and additional information. Many of the opt for natural language processing techniques. You can for example try to identify all of the Book titles from a blog post, and highlight and link them. This is simpler, cleaner, will give you pretty good precision and recall. The writer is in control over what gets linked and the target provides great context. Is it generally applicable? No. Can you drop it onto any blog on any topic and get good results? No. It targets a few niches, and handles those niches well. The development effort is relatively small. That will carry the day over complicated general-purpose techniques.
The Firefox plugin I’m not so hot on. Related pages from Sphere and Google are pretty simple to build in, they both have url’s that will fetch related content automatically that aren’t particularly helpful, or related sites which is usually even less helpful. Saving and sharing aren’t things I typically do, and many sites have plugins to help with things of this nature.
Posted by Ken Ellis
RSS