Saturday, March 28, 2009

How do Search Engines Work?

Search engines do not really search the World Wide Web directly. Each one searches a database of web pages that it has harvested and cached. When you use a search engine, you are always searching a somewhat stale copy of the real web page. When you click on links provided in a search engine's search results, you retrieve the current version of the page.

Search engine databases are selected and built by computer robot programs called spiders. These "crawl" the web, finding pages for potential inclusion by following the links in the pages they already have in their database. They cannot use imagination or enter terms in search boxes that they find on the web.

If a web page is never linked from any other page, search engine spiders cannot find it. The only way a brand new page can get into a search engine is for other pages to link to it, or for a human to submit its URL for inclusion. All major search engines offer ways to do this.

After spiders find pages, they pass them on to another computer program for "indexing." This program identifies the text, links, and other content in the page and stores it in the search engine database's files so that the database can be searched by keyword and whatever more advanced approaches are offered, and the page will be found if your search matches its content.

Many web pages are excluded from most search engines by policy. The contents of most of the searchable databases mounted on the web, such as library catalogs and article databases, are excluded because search engine spiders cannot access them. All this material is referred to as the "Invisible Web" -- what you don't see in search engine results.