The Clever System

“Mining the Web’s Link Structure”

Chakrabarti, S.; Dom, B.E.; Kumar, S.R.; Raghavan, P.; Rajagopalan, S.; Tomkins, A.; Gibson, D.; Kleinberg, J.
Computer Magazine
Volume: 32  Issue: 8  Aug 1999
Pages: 60-67

http://ieeexplore.ieee.org/iel5/2/16967/00781636.pdf?tp=&arnumber=781636&isnumber=16967

In 1999, when this article was written, search engines were much more ineffective than they are today. Searches would often return with thousands of sites, many of them not totally relevant. The search engines at that time searched for words in the body of the text only, and sometimes did not acknowledge the sites with the most information related to the search keywords. One of the examples from the paper is that a search on “Japanese Automobile Manufacturers” wouldn’t necessarily find links to the homepages of Honda or Toyota because these exact words aren’t necessarily in the body of these sites. Similarly, a search on “British Rock Bands” won’t bring up The Rolling Stones’ homepage. To find sites on these relevant topics, the researchers who wrote this paper (including our own Jon Kleinberg!) developed the Clever system.

Clever is a search engine that works by analyzing hyperlinks based on their authorities, which are sites that are frequently linked to, and their hubs, which provide collections of links to different authorities. This system uses the HITS (Hyperlink-Induced Topic Search) algorithm. To explain this algorithm, it is best to think of the Web as a very large graph, where nodes are web pages and edges are hyperlinks. The Web had about 300 million web pages on the Internet at the time (compared to about 10 billion now), with each page linking to many others. The user enters the keywords of his search, which narrows the 300 million sites to about 200 or so sites with the keywords in the body of their webpages. This initial grouping may or may not contain the ideal search results, so the program follows the hyperlinks on each of these initial sites to try to see what sites they are linked to. If enough of these initial sites link to the same result sites, Clever recognizes that these sites (nodes) with many links to them (edges) are authorities. Clever will then place the authorities with the most links to them at the top of the “search results” page. Organizing directed graphs, very similar to the graphs of social networks that we studied in class, are really the fundamental driving force behind the Clever search engine and others like it, such as Google.

Posted in Topics: Education

Responses are currently closed, but you can trackback from your own site.

Comments are closed.



* You can follow any responses to this entry through the RSS 2.0 feed.