Search engines all function in approximately the same fashion—a software agent, known as a bot, spider, or crawler, visits a page, gathers the content, and stores it in the search engine's data repository. Once the information is in the repository, it is indexed. The crawling and indexing processes are constant and on-going. Each of the major search engines maintain multiple crawlers that work tirelessly to refresh their index. The spiders find new pages by a variety of methods, typically including XML sitemaps, URLs already in the index, links to pages discovered while indexing, and URLs submitted for inclusion by users. How frequently they visit a specific site, and how deeply they spider the site on each visit, varies.
When a user visits the search engine and runs a search, the search engine extracts (from the search engine's index) a list of pages that are relevant to the query and then displays that list of pages to the user. The output on the search results page is defined according to each search engine's own criteria. The ranking methodology used by each engine is the result of the search engine's secret algorithm.
The search engine's crawler is primarily interested in certain types of information on the page, particularly the URL, the text, and the links on the page. Formatting is not indexed. Images and other media are indexed by most search engines, but to varying degrees of depth. Some types of media, such as Flash or attached files, are rarely indexed, though there are exceptions.
Note
Seeing what the spider sees
If you have a Google Webmaster account, you can see a web page exactly as the Googlebot (the name of the Google crawler) sees it. To do this, log in to Google Webmaster Tools (http://www.google.com/webmasters/) and click on a site profile. In the navigation menu on the left, select the Diagnostics menu and then select the option Fetch as Googlebot . Type the URL of the page you want to see and after a delay, the system will produce the results. You can see a webpage, as shown in the following screenshot, followed by the Googlebot's view of the same page: