» GoogleBot » Cornell Info 2040

Cornell Info 2040 - Networks

GoogleBot

Tuesday, March 25th, 2008 10:56 pm

Written by: jab638

In class we touched on the topic of how web pages get indexed and how search-engines keep track of the new pages. There were various sections on Crawling, Searching the Web and Ranking the Web-Pages. However, it was not quite covered how the most popular search-engine nowadays does it. Hence, I thought it would be a good idea to write an article on how Google it is so successful at achieving this.

According to an article by the University of California-Berkeley on the Internet, millions of pages get added to the Web on a daily basis (to be more specific 7.3 million). So how is it possible to keep updating the search-engine’s indexes to take into account the newly created pages? Well, Google has a quite common way of doing this, through what is called GoogleBot. As the Wikipedia article on Googlebot describes, it is an automated script (also known as Web-Crawler) that browses the Web in search for newly created pages and pages that have been extended/updated. The frequency at which it does so varies between pages, reserving the highest frequencies for blogs, forums and news articles, and leaving the lowest ones for statistical web-pages. The Googlebot consists of exactly two types of bots, one called the Deepbot and the other the Freshbot. These two have quite similar yet very different tasks. Deepbot is in charge of searching through links for pages while Freshbot has the task of looking for extensions and updates of pages already indexed. Google achieves this by requesting and fetching pages through many computers at the same time. Once a new link/page is found Google adds it to a queue, from which it retrieves pages and adds them to its index database, which is in turn organized alphabetically. The ranking process is done through PageRank which is described in one of the articles on the blog. The process further continues since now the pages need to be ranked and matched to relevant queries.

However simple these processes may sound, Google actually encounters numerous problems while indexing new pages since spammers take advantage of these processes to index pages full of advertisements. The methods vary, it ranges from the normal Add URL spam (adding multiple pages directed to propaganda) to cloaking, the spam in which pages are filled with false information so that they are matched to a random query. Thus, Googlebot must be able to deal with such situations so that such pages are not returned in a query nor ranked among the first pages. The algorithm is quite intricate and must be updated continuously since spammers come about doing this different ways on a daily basis (it is sometimes called a war, with spammers trying to defeat the system and search-engines trying to deliver a good service).

This article covers some of the topics discussed in lecture while giving a more specific example of how crawling a web happens. In lecture we discussed the general idea behind it. With this article I think I provide an idea of how it is specifically done in one of the search-engines, thus contributing to the explanation behind web-crawlers. Additionally, it touches on another type of network, that between the spammers and the search-engine. A constant interaction, that although not good, provides an example (like at the beginning of the course) of the interaction between two groups.

Posted in Topics: Education

Responses are currently closed, but you can trackback from your own site.

Comments are closed.

* You can follow any responses to this entry through the RSS 2.0 feed.

GoogleBot

Information

Categories

Previous Posts