Robots.txt - The What, the Why and the How!
This installment of linking we need to talk about how to exclude files from Search
Engines.
Here are the questions I will answer in this issue:
- Why do I need a Robots.txt file?
- Where do I put it?
- Is there a tool to create it?
- What happens if I don't have it?
This article is part V of the series on Linking & Search Engine Optimization.
You can read the other articles here:
Understanding links for Search Engine Optimization
How a Search Engine Sees Links -
Links for Search Engine Optimization Part II
Better Internal
Linking for higher Ranking - Links for Search Engine Optimization Part III
Absolute Linking
Vs. Relative Linking - Understanding Links for SEO Part IV
Why do I need to use Robots.txt?
There may be different reasons why it makes sense to hide your pages from being
indexed by the Search Engine. Robots.txt is part of the technology agreed upon by
the major search engines to not parse or index files that have been explicitly flagged
for exclusion.
Some of the reasons why pages are excluded are:
Some sites have private pages that should be made available for paid members only.
In this case, if a Search Engine indexes a page, that page will become publicly
available. Any one searching on the same keyword as that of the page will find a
copy of that page through the Search Engines.
Duplicate Content:
Some sites maintain pages with on them for the purpose of printing or being viewed
by different media.
For example, the printable pages are formatted differently but have the same content.
The markup may be different for pages meant to be viewed by mobile phones.
Duplicate content is a big no-no for Search engine optimization. Imagine multiple
search results showing up the same exact page! That cannot be a pleasant experience
for the searcher. Sometimes web-sites are dropped from rankings because of
duplicate content. AdSense Account applications are rejected due to mere suspicion
of duplicate content.
Privacy Pages :
Most sites have common pages such as privacy pages that need to be linked; but not
indexed. Some times pages that are served from the database cannot be indexed due
to the dynamic nature of the data.
In all the above cases, Search Engines may penalize you and the ranking of your
pages may suffer.
Admin Pages:
These include pages that are solely to be used for maintenance and administrative
purposes. A good example might be the login page, the password reset page, add/edit
product page, Email/Contact us Page etc.
How do I use Robots.txt?
Robots.txt should be placed at the root folder of your web-site. This is usually
the place where you have your home page and other config files. When the spider
reaches your web-site, it normally starts crawling from the home page down.
What kind of data do I need to put in a Robots.txt file?
Here is a sample of what you can use:
User-Agent: *
Disallow: /Link_Back.aspx
Disallow: /Privacy-Policy.aspx
Disallow: /sitemap-display.aspx
Disallow: /forprint/*
Allow: /
User-Agent: Googlebot
Disallow: /Link_Back.aspx
Disallow: /Privacy-Policy.aspx
Disallow: /sitemap-display.aspx
Allow: /
Note: For most purposes, it is sufficient to just use the User-Agent: * command.
Of course you can individually target the different search engines for non-inclusion
of your pages.
Meta No Index tag
When you block pages via the robots.txt file, the page content does not get indexed.
The URL of the page does show up in the search results nevertheless. An alternative
is to use the meta noindex tag.
With this tag, you can do any of the following:
- Block a page from being indexed.
< meta name="robots" content="noindex" >
- Block a page from being indexed and disallow following of all outbound links.
< meta name="robots" content="noindex,nofollow" >
- You can also use wildcards in the names of the pages
For example, I have an old article on linking in my articles folder. I now decide
to write a new one to update the old article. So, in order to avoid duplication,
I want to exclude the old article from being indexed.
But what do you know , I am too lazy. Instead of taking the time to figure out the
full URL of the page, I decide to go with the following:
eg.Disallow: articles/*linking*
Please note that this is quite dangerous as it may block all pages that have "linking"
anywhere in the name- which happens to be the case on my web-site.
Aaron Wall of SEO Book
has an interesting article on the subject of incorrect entries in the Robots.txt
file.
Where do I find a sample of the robots.txt file?
If you have a Google Webmasters account, you can go ahead and use the tool available
there. This tools is pretty easy to use; as long as you use it minimally and to
block each page by name.
Using wild cards is not a good idea for beginners.
What happens if I don't have a robots.txt file?
- All your pages will get indexed. Unless of course you go and use the noindex meta
tag.
- Duplicate pages get indexed and you may receive a penalty.
- A 404 Page-Not-Found error will be raised every time the Search Engine Spider
visits your web-site because it looks for the robots.txt file first.
- A lot of un-necessary bandwidth is consumed when the spider crawls your site.