The Small World Web

In the paper The Small World Web by Lada A. Adamic, the author attempts to demonstrate that the largest strongly connected component of the link graph representing the internet satisfies the small world phenomenon. That is, he shows that sites in the largest component are separated by an average of 4.2 hops which is roughly the same for a random graph of the same size, while the clustering coefficient is .081 which is larger than the clustering coefficient of a random graph of the same size. For those unfamiliar, the with the clustering coefficient is defined as follows: at each vertex v, if v has k_v neighbors then there can be at most k_v*(k_v-1) directed edges between them. If C_v is the percentage of present edges, then the clustering coefficient is just the average of C_v over all vertices. According to the original paper of Strogatz and Watts, a graph satisfies the small world phenomenon if the average number of hops between sites is approximately equal to a that of a random graph of the same size while the clustering coefficient is larger than that of a random graph.

It’s important to note that the author only considers the web at the site level as opposed to the page level. This means that two web pages housed at the same URL are considered the same, which makes sense as a valid simplification since web pages within a site are usually strongly connected. In addition to proving that the web is a small world graph at the site level, the author goes on to propose a new search algorithm based on hubs/authorities on small world graphs. The idea is as follows: first off, query the search words and gather a list of sites containing pages which are relevant to those words. Next, find the strongly connected components of the returned sites, and calculate the average number of hops across each component to get an estimate of what site may be the center of that component (i.e. the distance between that site and any other is roughly the average number of hops). When this is done we can give spanning trees rooted at the centers of the components, and calculate the clustering coefficients for each. The search engine will then return the centers to the users along with relevant data such as the size, the clustering coefficients, average number of hops, and spanning trees in a ranked fashion.

This type of search has a benefit in that if the center returned to you is not very useful, chances are that a useful page may be a few clicks away as the components are small world graphs. The presented search seems to build off of the hubs/authorities idea as the centers returned will most likely be hubs, while pages they link to will be authorities. The author goes on to make some arguments why the search will return good results, but I do not believe that it’s very practical. In most searches these days, it is very probable that Wikipedia would be included in a fairly large component as it links to many other sites which probably have paths back to it, so you may encounter very large components. Calculating the components, necessary quantities and spanning trees would then take a bit of computing power and probably wouldn’t be returned quickly enough. Also, with the popularity of Wikipedia as it is, it’s probable that Wikipedia would be the center of a very large component, so it would be returned first. This is no more helpful than just querying Wikipedia by itself.

Another issue with the paper is that it was written around the year 2000, and the web has evolved a great deal since then. It may not satisfy such a nice small-world phenomenon anymore, and with Wikipedia dominating, I can’t imagine this search would be all too effective.

Posted in Topics: Education

Responses are currently closed, but you can trackback from your own site.

Comments are closed.



* You can follow any responses to this entry through the RSS 2.0 feed.