The Deep Web

In class, we talked about the Web as a network. We also mentioned that much information on the Web is not indexed by search engines and is part of the internet’s “dark matter.” This content is also known as the deep Web, invisible Web, hidden Web, or Deepnet. The deep Web consists of:

  • Dynamic and scripted content - pages created on the fly in response to some request or pages accessible through JavaScript of Flash links
  • Unlinked content - pages that contain no in-links
  • Excluded content - pages that prohibit search engines from crawling them
  • Non-HTML content - content other than Web pages is often considered part of the deep Web

The popular search engines like Google, Yahoo, and MSN index the surface Web and some of the deep Web. These search engines have been getting more sophisticated in terms of handling content other than standard HTML pages (such as indexing pdfs, docs, and other filetypes), but still leave much of the deep Web untapped.

Much of the deep Web content is stored in databases and can be queried with web front ends (such as searching nytimes.com for archived news articles), but each site must be searched individually. Some Web sites that can search these deep Web sites are popping up. They work by coalescing search results from many databases. An example is Pipl, a people search engine, brings together results from MySpace, flickr, Amazon, the Securities and Exchange Commission, and other resources to present comprehensive list of results for a given person.

Posted in Topics: Technology, social studies

Responses are currently closed, but you can trackback from your own site.

Comments are closed.



* You can follow any responses to this entry through the RSS 2.0 feed.