Networks that Cluster

Note: This was also posted on my own blog: http://aysz88.livejournal.com/27137.html . It may look better there.

In the February 16, 2007 issue of Science, Brendan J. Frey and Delbert Dueck apply “message-passing” networks to a problem that doesn’t seem particularly natural in that context: how does one cluster large numbers of objects, and then find the “center” object of each cluster? For example, can you create an algorithm that clusters images of faces, genes, airports, or even “sentences in [their] manuscript”?

(The article is linked at http://www.sciencemag.org/cgi/content/abstract/315/5814/972. I think all computers within the Cornell domain can access it for free; I’m not sure whether it’s free for those outside. There actual journal article is linked on the left sidebar. There are also links to a “perspective” on the article by Marc Mézard below the abstract - those links won’t take you to the actual article in question.)

Abstractly, the problem involves having a set of nodes and wanting to find a set of centers (”exemplars” ) such that the errors are minimized. (Any ways of measuring error seems to work just as well.) In their article, they explain that the previous popular method first selects a set of possible exemplars and then refines the set in each iteration, but that method does not always converge to the same set of exemplars - selecting a different initial set affects the results. Their algorithm, they claim, vastly improves on the previous one.

Their new algorithm, which they call “affinity propagation”, ends up being extremely powerful - a single run of their algorithm (while matching faces) beats out the best result for 10,000 randomly-initialized runs of the old method (see their Figure 2). They describe this as “two orders of magnitude” better. In their Figure 3, they graph how a single run of their algorithm on DNA also produces results far better than 10k runs of the old algorithm.

We’ve just started working with the Internet as a network - one could perhaps apply this to characterize whole sections of the Internet, if one could define a way to quantify the similarity of pages on the Internet. Or, more-practically, a search engine could cluster their search results into groups of similar results - a search on “Washington”, for example, might be automatically clustered into pages on the president, the state, or the city. An advantage here is that the algorithm doesn’t even need to know how many clusters to use - it can dynamically determine this using an abstract “how much does a node want to be a center” value.

Here is how the algorithm works, in an attempt to express the algorithm at a high level without using the “guardian angels pointing” analogy of Mézard’s “perspectives” article.

Their method consists of two kinds of inputs:

  • Develop some way of expressing the “similarity” between two data points. (Similarity can be the negative of a “distance”, for example, and one could make the similarity different in each direction, but to simplify things I’ll assume they’re the same both ways.)
  • The value you submit for the similarity of a node to itself affects how much that node wants to be a exemplar - this is the node’s “preference” towards becoming an exemplar. High preferences produce lots of exemplars - low preferences produce fewer exemplars.

Now in each iteration, the nodes pass messages to each other:

  • Each node a figures out how much it wants to choose each and every other node b as its exemplar - a number is calculated for each b and that number is sent as a “responsibility” message from every node a to every other node b. (If you’re reading the article, note that the responsibility message from a to b is how much b is responsible for being the exemplar of a, not the other way around.)
  • Each node a also figures out how “available” it is to be an exemplar for each other point b, considering how much other nodes want it to be an exemplar. Another number, a “availability” message, is sent from each a to each b. The maximum availability is zero and it goes down from there.

(Be aware that responsibilities can eventually go negative and availabilities are always zero or negative, so the plus signs can be misleading in some of these contexts.)

To find the exemplar of a, find the b that maximizes (responsibility of b towards being the exemplar of a + availability of b towards a).

At the beginning, each node’s availability is zero. Now we can start an iteration.

In an iteration…

At the start of an iteration, all responsibilities are updated. To calculate responsibility from a to b, a first considers how much it wants other nodes to be its exemplar instead - that is the maximum availability plus similarity between a and every node other than b. Call this number x. The similarity between the two points minus x is the responsibility from a to b (the responsibility b has to being the exemplar of a).

To sum that paragraph up, the responsibility for b towards a = (the similarity between a and b) - (the best alternative (availability + similarity) for a).

The responsibility that a node has to itself is calculated as its preference minus the largest (availability + similarity) it has for any other node. The meaning of “self-responsibility” is how much a node wants to be an exemplar, but taking into account how well it could be assigned to another exemplar. The equation for “self-responsibility” is the same as for responsibility.

Now all the availabilities are updated.

The availability for b to be the exemplar of a is the minimum of:

  • 0, or
  • b’s self-responsibility (how much it wants to be an exemplar) plus the sum of the positive responsibilities from b to every node other than a. (negative responsibilities towards faraway nodes do not detract from how well b is an exemplar.)

Since the maximum availability is 0, it appears that all exemplars would have the maximum availability of 0.

A node’s self-availability is the sum of the positive responsibilities to all other nodes.

The authors of the paper also use a “damping factor” to avoid oscillations - each message is nudged back towards its previous value by some amount, in the paper’s case 0.5. They terminated their algorithm after 10 steady iterations.

Posted in Topics: Education, Mathematics, Science, Technology

Responses are currently closed, but you can trackback from your own site.

One response to “Networks that Cluster”

  1. Article Feed » Networks that Cluster Says:

    […] Original post by aysz88 and a wordpress plugin by Elliott […]



* You can follow any responses to this entry through the RSS 2.0 feed.