On Friday I gave a short presentation at Daylife on some new article ranking algorithms that we’re developing. I started with some background and tried to explain the venerable PageRank algorithm a la Google, 1998. To do that, I wanted to explain eigenvalues. I suspect most of the audience at some point were taught about eigenvalues, but that few if any ever used them, and for me at least understanding only comes through use, or through the sort of reflection one does not have time for in college.
Another bit of reflection, which with my lengthy train commute I now have time for, is on the naming of things, like “Eigenvalue”. Computer Science is particularly rife with long names for simple things. Take for example the Levenshtein distance, with appologies to Mssr. Levenshtein. What a trivial thing, it hardly deserves a name. It should have been given a descriptive moniker.
It appears to me that much of the educational process is about establishing hollow, ragged shells around knowledge, and this is not such a bad thing. If you aren’t completely up on Green’s Theorem, you’ve heard about it, and can at least circumscribe the subject and thereby understand it to some degree, if you’ll excuse the pun. That it has been assigned a unique and non-descriptive name lets you discuss it without fully understanding it, and obtain all the details if you need to. In the idiom of computer science, it has been reduced to a class, whose inner workings you do not understand, but you know a few of the method profiles. If you ever need the details, you can find them. This I suppose is how the mind prevents itself from getting tangled in details.
Back to eigenvalues. I wanted to present PageRank and the eigenvalue problem simply, and engage the audience, so I came up with the following demonstration. You will need three volunteers, and 24 pennies.
1) Give each volunteer 8 pennies, and name them “A”, “B”, and “C”.
2) Have each volunteer exchange money given the following rules:
A: 25% to A (self), 25% to B, 50% to C
B: 100% to A
C: 50% to A, 50% to B
3) Have the volunteers count their pennies. You should get:
A: 14 cents
B: 6 cents
C: 4 cents
4) OK, so not much interesting there. Pennies keep shuffling around. But where will the money spend most of its time? Lets try a new starting point. Have A give C two of their pennies. Now you are starting with:
A: 12 cents
B: 6 cents
C: 6 cents
5) Apply the same transfer rules as in step 2.
6) Have each volunteer count their pennies. You should get:
A: 12 cents
B: 6 cents
C: 6 cents
The voluneers have the same number of pennies as they started with! Wow! (Fake excitement here if you have to.) This is a very important result, it describes a stable point in the solution, and also where the money will spend most of its time. Although we’re not going to prove the latter.
It so happens that the money distribution in step 6 is the eigenvector, and its eigenvalue is one. This is also the PageRank problem, with
cents -> page views
volunteers -> web pages
transfer rules -> page links
The transfer rules are represented by a 3×3 matrix, where the first column represents A’s transfer rules, and the first row represent’s the transfers to A, and so on. You can think of this matrix as an operator that causes everyone to follow a link:
[ 0.25 1 0.5
0.25 0 0.5
0.5 0 0 ]
This is also a great way to visualize some of the finer points about applying PageRank. What happens if someone takes money, but doesn’t give it out to anyone? This is called a dangling node in the graph. What happens if you have two distinct groups of people, and no members of one group sends money to anyone in the other group? This is a block-diagonal matrix, and the eigenvector can be an arbitrary mixture of the eivenvectors of each block, technically an eigenspace.
One answer is to take a small fraction of everyone’s money, a tax if you will, and redistribute it evenly to everyone. This eliminates dangling nodes, the analgy with PageRank being that upon hitting each node, you have an equal chance of going to any web page. For the block diagonal transfer matrix, this provides a mechanism where the money in each gradually equalizes, and each block is weighted according to the number of members in the block. A democratic mechanism, since we lack any other way to rank their importance.
Is PageRank the most accurate way to estimate page rank? Of course not, and literature is full of other, no doubt better, estimates. The great thing about PageRank though is its simplicity, and that it reduces to a very common problem — finding an eigenvector. There are very fast algorithms for even large matricies. So it satisfies one of my tenants, keep it simple. Applying something as simple as PageRank on a large scale will present plenty of problems.
It reminds me of a quote attributed to Brian Kernighan, “Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.” Not strictly true, but its a simple rule that is approximately correct, something that in this business usually wins.
RSS