Calais Notes From The NY Semantic Web Meetup

The Calais Meetup talk today by Tom Tague of Thomson Reuters provided a wealth of information not generally available.  The slides should be available soon, but most of the interesting information was conveyed verbally, so I have made notes below.  Tom is an excellent evangelist, and amazingly open about their long term objectives and some of their inner workings.

I’ll assume you’re familiar with the service based on my previous post, and so I’ve made some notes below more or less in the order they came up.

  • They currently provide a dozen high level news categories (sports, business, etc.), using a simple bag-of-words machine learning classifier.  In mid-2009, they will be expanding this to 300.
  • Named entity recognition (NER) is generally accomplished using lexicons and hand-crafted rules.  They have a proprietary language for expressing these rules.  Thompson Reuters has at there disposal a massive amount of data, “oompty-zoomp petabytes” is the official size, and the Calais team also pulls in external lists from sources like dbPedia. 
  • NER for companies is accomplished using a list of 17MM aliases. These aliases seem to be provided by the URI endpoints for the companies.  If an extracted name does not resolve unambiguously, then a selection is made based on the presence of other entities and terms within the document.  The service indicates whether an extracted entity was disambiguated, and this only happens if reliability is high.
  • They hope to eventually provide dials to adjust analysis depth and time, and perhaps allow for multi-pass processing.
  • All linked data is generated on-the-fly.  For example, company information may change, and as soon as it is reflected in their linked database, it will be reflected in their service. 
  • The Calais team consists of about 25 NLP experts and 25 developers.
  • Many of the components, including their metadata engine and disambiguation, are running in an EC2 cloud.  They use Hadoop heavily.  I have no idea what he’s talking about here, but if you’re a developer or architect I thought it might be interesting information.
  • They have 9000 registered developers, and of this ammount 90% have used their API keys to process at least one document.  That’s a very respectible rate for follow-through.
  • They are currently processing 1MM transactions per day.
  • Tag clouds are the mullets of Web 2.0.  Nothing to do with Calais, but an amusing metaphor.
  • Mail&Guardian is using their service to extract people and places, and uses this to build a content index and topics.  They also have created a map for countries.
  • HealthcareITNews uses Calais to extract info and formulate searches into their catalog of content.
  • In 2009 they plan on adding NER for German, Chinese, Spanish, Hebrew, and possibly Portugese.  They of course launched French yesterday.  There may have been another language or two in the list, perhaps Italian, but I’m not a stenographer.  This is a significant undertaking.
  • They are considering person disambiguation for 2009 for certain domains.  For example, within sports, politics, or science.  
  • Another possible item for 2009 is an opt-in for a SPARQL endpoint.  They retain all of the metadata they generate, and stuff it in a triple-store.  This, clearly, would be a powerful feature.  
  • Other things they are considering for 2009:  user-managed lexicons; disambiguation for other named entity types; expanding endpoints for entities and events; exposing an IDE for users.  I wasn’t quite sure what he meant by IDE in this context.
  • They’re perfectly happy to have you use their service to train or tune other NLP systems, and admit that even if they were not happy, there is nothing they could do to prevent it.
  • They encourage the use of GUID’s globally, and as a way to syndicate metadata.  For example, instead of bundling the metadata yourself, use the Thomson Reuters GUID as a handle and let them store it.

Tom also discussed some of the motivations behind Calais.  I’m going to do a bit of interpretation.  Thomson Reuters has a large collection of subscription data services.  They eventually want to link to these services.  Widespread use of Calais increases the ease with which customers can access these subscription data services, ultimately increasing their ability to extract revenue from them.

As a final word, this could power a lot of great applications, and their roadmap looks promising.  A lot of people have rolled their own systems over the years to do something similar but not as well, and this could have saved them a lot of work.  Depending on Thomson Reuters is not ideal, I’d rather have the source of course, but that they’re in it ultimately for revenue from other subscription data services provides some comfort.   Passion, largess, and good business:  judge for yourself which is the more reliable motive.

—-

Good article on Calais from ReadWriteWeb

Slides are available online.

3 Responses to “Calais Notes From The NY Semantic Web Meetup”

  1. New Calais Version, More Linked Data « NP-Harder Says:

    [...] Calais Notes From The NY Semantic Web Meetup « NP-Harder Says: January 15, 2009 at 11:33 pm [...]

  2. Alex Genadinik Says:

    Hi,

    I mentioned your Open Calais notes in my blog at semanticalley.com

    Thanks for posting the notes.
    - Alex

  3. Questions for Open Calais? « Network(ed)News Says:

    [...] which may be best known to journalists through its association with Jeff Jarvis, took a stab at answering the “why free?” question: Thomson Reuters has a large collection of subscription data services. They eventually want to link [...]

Leave a Reply