Controlling search engine spiders is critical for your search engine optimization. By controlling these spiders will help to preventing duplicate content while ensuring that search engines focus on your most important pages.
![]() |
You might be thinking you need a PhD but spider control is actually easier than most people think. It’s simply a matter of deploying an essential tool called the robots.txt file. Robots.txt gives spiders (aka, robots) the direction they need to find your most important pages. This file ensures a spider’s time on your site will be spent efficiently and not be wasted indexing pages you don’t want them to index.
Why Do I Want To Control The Search Spiders With A Robots.txt?
Think of your robots.txt file as the tour guide to your site for the search engines. It tells search engines where to find the content you want indexed. I say this again, IT TELLS THE SEARCH ENGINES THE CONTENT YOU WANT INDEXED by telling them to skip the content you don’t want indexed. The end result is a faster and more complete indexing of your site.
Do I Need A Robots.txt File?
No, not really. Whenever a search engine spider crawls your site, the first thing they do is check your robots.txt file to see which pages they should not index. If you want the search engine spiders to crawl and index every single page on your entire site, then you really don’t need a robots.txt file at all.
So I Want A Robots.txt File – Now What?
I suggest you think about this carefully and do it on paper before you put it live on your site. It’s important that you be very careful whenever you’re making changes that impact which of your pages get listed in the search results.
Your robots.txt file must be located in the root directory of your domain, such as:
http://www.frankpipolo/robots.txt. Putting it in a subdirectory such as:
http://www.frankpipolo.com/news/robots.txt does not work so please don’t do it.
A single robots.txt file in your root directory is all you need to manage your entire site. If you have subdomains (frank.frankpipolo.com) in that case, each subdomain should also have its own robots.txt file.
To create your robots.txt file, use a text editor such as NotePad, Textpad (do not use MS Word) or Apple TextEdit. Then just name it file robots.txt. You can have your robots.txt no more than two lines like the very basic file shown below:
User-agent: *
Disallow: /category/
Let’s talk about how to create more advanced robots.txt files to handle all of your spider control needs.
Using Robots.txt Disallow To Stop Your Site From Being Indexed
As we said before, your robots.txt file is there to stop search engines from indexing pages that you don’t want showing up in the search results. Whether your content management system automatically creates duplicate content or you’re working on a new section of your site that’s not fully developed, using a robots.txt will ensure that search engines are only indexing the pages on your site that you want them to.
For blogs, robots.txt is especially handy because most blogging platforms create multiple URLs to display each post. For example, when you create a blog post using WordPress, the posts are shown on:
- The main post page
- The archive page
- The category page
- The tag page
That’s four duplicate pages of content for one post! Using a robots.txt makes it easy to block spiders from the directories where your duplicate pages will be found.
To disallow a whole directory or section of your website just add the following to your robots.txt file:
User-agent: *
Disallow/folername/
You might be asking what is User-name and why the star (shift-8) symbol. This symbol represents the wildcard symbol and matches everything thus all spiders would follow the next command of disallowing the folder. Now let’s say you wanted only Google’s spider named Googlebot not go into a specific folder. The code would look like this:
User-agent: Googlebot
Disallow: /foldername/
FYI: the names of the major search engines are (Googlebot) – Google, (Slurp) – Yahoo – MSNBOT – (MSN) & Teoma – Ask.
Overall that really is about it when it comes to controlling the search engine spiders with the use of the basic Robots.txt file. In a future post I will talk more about advanced things you can do but for the majority of sites out there your basic Robots.txt file will work fine.

