Web Taxonomies: File under “D” for Dead.

August 22, 2008

We had an interesting discussion about taxonomies today, when they are useful, and whether they are useful at all.  Some of our competitors have invested heavily in them, and sometimes we run across a client that wants them.

I suppose first I should define what I mean by taxonomy when it comes to news.  It refers to placing articles or topics in a hierarchical structure.  For example, senators are a subgroup of politicians, are a subgroup of people.  An article on the war in Iraq might be filed under Conflicts, which itself is filed under International Affairs.  Each topic has at most one parent, and may have many children.

Most news sites and print publications have some structure by which it categorizes news.  The New York Times has a dozen or so sections, for Technology, Business, Sports, and so on.  But in common parlance most would not call this a taxonomy, and neither shall we.  Its a high-level classification, and once for example you dive into the Technology section, there are few if any subgroups.  Their Arts section for example is broken down into Books, Movies, Music, Television, and Theater.  All in all, there are only 25 groupings on their web site, of which 13 are top-level groups.  I suspect its a reflection of how a print paper might be organized to make it easier to read.  You want Business news?  Pull out the Business section, and you get a dozen or so pages.  From there, perhaps there are a few subsections that take up a page or two.

What most mean by taxonomy is best demonstrated by Inform.  If you go to their US Politics page, you have a couple of top-level categories, each of which has ten or so sub-categories.  If you go into one of these subcategories, you have further options for navigating further down the tree into additional subtopics.  For example, you can go from US Politics, to Great Plains States, to Kansas.  In this case, the hierarchy is three deep.  If each parent had ten child nodes, and everything went three deep, that would give them about a thousand topics.  I don’t know the exact count, but the point is its a lot, far more than the New York Times.

Is it useful?  Say I want news on the NBA.  In the case of Inform, I mouse-over Sports, click on Basketball.  Then I select the NBA subtopic.  Now I’m at the NBA page.  Or I enter “NBA” into the search box, and it takes me to the NBA topic page.  From there I can refine by different divisions.  Its an OK experience.

Say however I don’t have a taxonomy stored away in some database.  I enter NBA, and go to an NBA topic page.  Then I calculate on-the-fly what other topics happen to be related.  That fetches topics for players, teams, and perhaps for the past week the Beijing Olympics.  That calculation is based solely on what other topic assignments are made to the articles assigned to the NBA topic.  It will surface many of the same relationships that a taxonomy would, plus a lot of other things that it would not, like the Olympics where the US team and a number of NBA stars are competing.  Score one for the computer.

These on-the-fly relationships have some advantages, such as making associations based on what’s happening now in the news.  You can of course combine that with a taxonomy, but having two ways to navigate would probably be confusing — a rigid taxonomy, next to a more fluid relationship engine.  They are also easier to maintain.  Constructing and maintaining taxonomies are something that humans do.  I would rather have the humans maintain and create new topics, and let algorithms determine which are related.  Taxonomies are dead.  Many years ago, when web sites were less dynamic, and high-traffic sites couldn’t perform complicated calculations every time someone looked at it, they made some sense, you could define all of the relationships ahead of time.  So Moore’s Law killed it.  As a most prominent example, look at the Yahoo! Directory.  When was the last time you used it?  The human-edited taxonomy was their first offering, most now go to their Search feature.  Look also at Wikipedia.  There is a fairly light taxonomy for most topics.  If you go to the George W. Bush page, there are categories at the bottom that constitute a heirarchy.  Now tell me, do you ever use them?  I never have.

I have seen a few sites where a taxonomic structure might appear to be useful.  One of them, powered by Kosmix, is RightHealth.  It is a health-focused web site that for any topic had a list of related drugs and organizations.  If you go to the Diabetes page, on the right are related topics broken down by type, for example Drugs, Health Associations, and Diseases.  That could be accomplished by a taxonomy of entities (Products -> drugs), but would be a lot easier if you just used tags, and tagged certain entities as drugs.  You can pull in related topics tagged as drugs on-the-fly.  Not a taxonomy, there is no hierarchy.  Its just algorithms and tagging.

So taxonomies are dead.  Show me an example where they might be used, and I’ll show you a better way with on-the-fly relationships and a flat topic or tagging structure.  Counter-examples and arguments are welcome.