Robots.Txt And Linking

Robots.txt - The What, the Why and the How!

This installment of linking we need to talk about how to exclude files from Search Engines.

Here are the questions I will answer in this issue:
  1. Why do I need a Robots.txt file?
  2. Where do I put it?
  3. Is there a tool to create it?
  4. What happens if I don't have it?

This article is part V of the series on Linking & Search Engine Optimization. You can read the other articles here:

Understanding links for Search Engine Optimization 

How a Search Engine Sees Links - Links for Search Engine Optimization Part II

Better Internal Linking for higher Ranking - Links for Search Engine Optimization Part III  

Absolute Linking Vs. Relative Linking - Understanding Links for SEO Part IV  

Why do I need to use Robots.txt?

There may be different reasons why it makes sense to hide your pages from being indexed by the Search Engine. Robots.txt is part of the technology agreed upon by the major search engines to not parse or index files that have been explicitly flagged for exclusion.

Some of the reasons why pages are excluded are:

Some sites have private pages that should be made available for paid members only.

In this case, if a Search Engine indexes a page, that page will become publicly available. Any one searching on the same keyword as that of the page will find a copy of that page through the Search Engines. 

Duplicate Content:

Some sites maintain pages with on them for the purpose of printing or being viewed by different media. 

For example, the printable pages are formatted differently but have the same content. The markup may be different for pages meant to be viewed by mobile phones.

Duplicate content is a big no-no for Search engine optimization. Imagine multiple search results showing up the same exact page! That cannot be a pleasant experience for the searcher. Sometimes web-sites are dropped from rankings because of  duplicate content. AdSense Account applications are rejected due to mere suspicion of duplicate content.

Privacy Pages :

Most sites have common pages such as privacy pages that need to be linked; but not indexed. Some times pages that are served from the database cannot be indexed due to the dynamic nature of the data.

In all the above cases, Search Engines may penalize you and the ranking of your pages may suffer.

Admin Pages:

These include pages that are solely to be used for maintenance and administrative purposes. A good example might be the login page, the password reset page, add/edit product page, Email/Contact us Page etc.

How do I use Robots.txt?

Robots.txt should be placed at the root folder of your web-site. This is usually the place where you have your home page and other config files. When the spider reaches your web-site, it normally starts crawling from the home page down.

What kind of data do I need to put in a Robots.txt file?

Here is a sample of what you can use:

User-Agent: *
Disallow: /Link_Back.aspx
Disallow: /Privacy-Policy.aspx
Disallow: /sitemap-display.aspx
Disallow: /forprint/*
Allow: /

User-Agent: Googlebot
Disallow: /Link_Back.aspx
Disallow: /Privacy-Policy.aspx
Disallow: /sitemap-display.aspx
Allow: /
Note: For most purposes, it is sufficient to just use the User-Agent: * command.

Of course you can individually target the different search engines for non-inclusion of your pages.

Meta No Index tag

When you block pages via the robots.txt file, the page content does not get indexed. The URL of the page does show up in the search results nevertheless. An alternative is to use the meta noindex tag.

With this tag, you can do any of the following:

  • Block a page from being indexed.
    < meta name="robots" content="noindex" >
  • Block a page from being indexed and disallow following of all outbound links.
    < meta name="robots" content="noindex,nofollow" >
  • You can also use wildcards in the names of the pages
    For example, I have an old article on linking in my articles folder. I now decide to write a new one to update the old article. So, in order to avoid duplication, I want to exclude the old article from being indexed.
    But what do you know , I am too lazy. Instead of taking the time to figure out the full URL of the page, I decide to go with the following:
    eg.Disallow: articles/*linking*
    Please note that this is quite dangerous as it may block all pages that have "linking" anywhere in the name- which happens to be the case on my web-site.
  • Aaron Wall of SEO Book has an interesting article on the subject of incorrect entries in the Robots.txt file.

Where do I find a sample of the robots.txt file?

If you have a Google Webmasters account, you can go ahead and use the tool available there. This tools is pretty easy to use; as long as you use it minimally and to block each page by name.

Using wild cards is not a good idea for beginners.

What happens if I don't have a robots.txt file?

  • All your pages will get indexed. Unless of course you go and use the noindex meta tag.
  •  Duplicate pages get indexed and you may receive a penalty.
  •  A 404 Page-Not-Found error will be raised every time the Search Engine Spider visits your web-site because it looks for the robots.txt file first.
  • A lot of un-necessary bandwidth is consumed when the spider crawls your site.


If you found this FREE article useful, please help others by sharing it!
Share this article with your friends on Facebook

Comments :

Please let us know if you liked this article. Even if you disliked it too ...
Whatever you do, don't go without leaving a note!

And do follow the house rules:
No punching, kicking or hitting below the belt. Screaming is allowed!
No links to places you don't want your mama to see!

Add your comments here:

Comment
( Sorry, no html tags in the comment box)

Name

Email(not for display )

Website

Enter your comments below.Write answer here.

Subscribe to our FREE Marketing Briefings
1. Pull ahead of your competition with "up-to-date"
Marketing Tips, Tools,
Tricks And Techniques !
2. Download FREE - Marketing Books That Have Made Millions!

First Name: *
Email Address: *

1)You'll receive an email from
Noble River Marketing;
2)Click the Confirmation Link to
get access to Members Area.



Are You Ready To Take Your Business
To The Next Level ?


Click here to Steal The #1 Secret of
The Most Successful Entrepreneurs
 


Articles on Marketing
Full List of Articles
 
Material Connection Disclosure
You should assume that the owner of this website has an affiliate relationship and/or another material connection to the providers of goods and services mentioned on this website and may be compensated when you purchase from a provider. You should always perform due diligence before buying goods or services from anyone via the Internet or offline!
Privacy Policy .:||:. Write for this Site .:||:. SiteMap
Copyright 2008, 2009, 2010 - All rights reserved.
Deep Janardhanan & Noble River Resources ™
9517 Craigs Mill Dr., Glen Allen,Va - 23060