The first type of spider is one that actually “crawls” the web looking for websites and pages. This program starts at a website, loads the pages, and follows the hyperlinks on each page. In this way, the theory goes, everything on the web will eventually be found,as the spider crawls from one website to another. Search engines run anywhere from dozens to hundreds of copies of their web-crawling spider programs simultaneously, on multiple servers.
When a "crawler" visits your home page, it loads the page’s contents into a database.With some spider programs, that’s all they do – load the home page, so that another spider can actually index it. Once your site has been found, the text of your page is loaded into the search engine’s index, which is a massive database of web pages. The last time I checked, both Google and FAST claimed to have over 2 *billion* pages indexed. Google has over 3 billion documents in its database.
Some search engines don’t do any more than load the home page, but this is becoming more rare. A search engine that only indexes your home page, or a few pages linked from it, is doing what’s known as a “shallow crawl.” Most search engines nowadays do a “deep crawl,” which means that they follow the hyperlinks on your home page, loading the pages they find, successively getting deeper into your site. Some of them have a limit on the number of pages they’ll index from a given site, others try to index everything.
There are other types of spiders as well. “404 spotters” are used by search engines to help avoid referring searchers to pages that no longer exist online. These spiders go through the search engine’s index page by page (or site by site), trying to load each page. If the page can’t be found, the web server returns a “404 error” which indicates that the page or site isn’t currently available online. When the spider (some of them will check later to verify that a page really is offline)
doesn't find a page, it’s deleted from the index. This is why it’s important to use a good web hosting provider. If your server is offline at the wrong time, your site may be dropped from a search engine’s index, and it can take several weeks before it’s indexed again.
Another term you may hear is “spider food.” This is shorthand for anything that’s placed on a web page (mostly hyperlinks) that is intended to attract a spider’s attention. Sometimes, these are invisible links that a web surfer would not find, and are intended to direct the spider to keyword-rich “doorway” or “hallway” pages specifically designed to fool the search engines.
I don’t use these kind of tricks, but you should be aware of them. I'll explain throughout this book why you should avoid using dirty tricks to fool the search engines. The first thing a spider is supposed to do when it visits your site is look for a file called “robots.txt”. This file contains instructions for the spider on which parts of the web site to index, and which to ignore. The only way to control what a search engine spider sees on your site is by using a robots.txt file. Chapter 8 provides detailed instructions on how to control spiders, and why you’d want to do this.
All spiders are supposed to follow certain rules, and the major search engines do follow these rules for the most part. One rule is that spiders should load only one page a minute – this rule came about when early spiders visited early web servers and tried to load entire web sites all at once. The result of this was that access for the site’s real visitors often slowed to a crawl. Since the web was much smaller then, a spider might visit your site several times a day! With modern, high-speed web servers, and spiders only visiting infrequently, this rule doesn’t make as much sense as it once did, but it’s
still followed.
kontynuuj czytanie