I, Robot

April 11th, 2009 § Leave a Comment

They come with names such as SLURP (Yahoo!) and Googlebot, and they’re the key to your site being properly indexed on the various search engines. What they are, of course, are robots, spiders, crawlers…whatever your preferred term of choice for the software programs that constantly scour the dynamic, ever-changing web, following and indexing one link after another.

Like your “ordinary” site visitors (each visitor is too important to actually be considered ordinary), the crawler sends a request to the web server for the pages it encounters. The biggest difference is that the crawler “reads” the page as text-only, which helps explain the importance of everything from site architecture to the manner in which the code for each page is written. As you’d expect, the easier it is for the crawler to evaluate the actual content, the better your chance for optimum results in the resulting Search Engine Results Pages (which is why you shouldn’t load up your page with a couple hundred lines of javascript before actually getting to the HTML and content, for example).

So, job one is to optimize every page for easier crawling, right? Not necessarily. Here are a few reasons why you might not want a crawler visiting every single page on your site:

  • Pages which are “under construction.” With little value to your target audience, not the sort of thing you want to see in your search results.
  • Pages full of links. This could look like a “link farm” to the crawler, and just might be considered spam with attendant penalties in the search results.
  • Pages comprised of old content. Again, little value to your audience and potentially damaging to your corporate image.
  • Confidential info. The best approach, of course, is not to publish company secrets. But if you must, access restriction should be first and foremost on your mind.

This is where the “robots.txt” file comes in, which in essence is a simple text file in your root directory with instructions for any crawler which scours your site. The robots file gives you the opportunity to exclude crawlers from any directory you choose, and no site should be without one. Fortunately, you don’t have to be a programming genius to write one; in fact, it’s actually pretty simple.

For maximum control, it’s probably best to specifically address the individual crawlers. The syntax itself couldn’t be more simple. For example:

User-agent: CrawlerName
Disallow: /DirectoryNameHere/
Disallow: /DirectoryName/FileName

And so on for each crawler you want to restrict. Whatever you do, though, always keep in mind that crawlers stop “reading” the robots file once they encounter instructions specifically directed at them…which is why you should never use the wild-card character if additional crawler-directed instructions follow. For example, beginning your robots.txt file with User-agent: * means that all crawlers will obey the first disallow command without “reading” the rest of the file (I once worked with a programmer, a talented developer and good friend, who used the wild card character without my knowledge and disrupted our search engine strategy for weeks…yes, our friendship survived!).

So, take control of the manner in which your site is crawled by placing a robots.txt file in your root directory (that’s all you have to do; the crawlers are programmed to look for it). Just beware of that wild card character…

Where Am I?

You are currently browsing entries tagged with robots at The SEO Perspective.

Follow

Get every new post delivered to your Inbox.