Canonicalization, Its Causes and Fixes
Canonical means unique. In terms of your website, it means each individual URL on your website should bring up a unique page.
Canonicalization refers to removing real and apparent duplications of pages from your web site. Search Engines index your pages using small software programs called bots that read the content on your pages. It then calculates the relative importance assigned to each page depending upon a multitude of factors including relevance of content and links pointing to your page.
When the bot indexes a page, it will compare the data found to every page it has in its list. IF your page has already been indexed but with a different URL, the bot considers the page as a completely different page; merely because of the different URL.
The biggest issue with duplications is the loss of link authority.When different pages are found with different URLs, the authority is now split between pages that have different URLs but the same content.
A page with a possible PR7 will remain at PR2 or PR3 because the PageRank is split between multiple pages. This would mean a huge difference in traffic and a lot of moolah for you.
Canonicalization issues arise due to a number of reasons. Here are the most common ones you need to know about.
- Your website may have both www and non-www versions available to the search engine bot.
- http://www.nobleriver.com is considered separate from http://nobleriver.com. You would expect search engines to be smarter than that given all the advancements in technology! It should be so simple for a bot to look at both pages and if they have the same content, point to the one with the www. in front of it; or the one without it.
- You may have multiple pages with identical content. When you offer special pages for printing or for display in different media, this may occur. Very often, printing facility is made available through a link which a search engine follows to find the printable page.
Example:
http://www.myexample.com/cannon-fodder.html
http://www.myexample.com/cannon-fodder.html/print
Since the URL of this page is different, it is considered as two pages with highly similar content.
- Sometimes we do have different pages with highly similar content due to inadvertent duplication.
- Very often we create a new copy and leave the old page lying around because someone is linking to it.
- Even if we drop some pages from our site due to a site redesign, the search engines still have the link to such pages in its index.
- Remember, for the search engine the internet is made up of individual pages that are merely linked to each other.
- Many web servers add additional parameters to your URL. This is mostly for dynamic pages that display different information at different times depending on the parameters it receives.
- One of the most common ways is how a catalog page will carry different color preferences so that it can load a different image. If only the image is different on the page, it may be accounted as a duplicate by the Search Engines.
- Some web servers add a session id to your URL. Intelligent search engines of course discount most common such parameters such as ID and sessionId. Some parameters, though may not be accounted for.
- The default page: Most websites have default pages at index.aspx or default.aspx. When you browse to your website, you just put in the domain name, but the browser translates it to domainname/default.aspx. Lo and presto - two different urls for the search engine bot to play with.
- Most web servers translate upper case and lower case to lower case and serve the same URL. Even though official specifications state that domains be case insensitive, spiders may forget that and consider your pages as individual.
- Sometimes because of the path the user arrived at a page, variables are attached at different times to the address of the page.
Hence you may have www.mysite.com/3rdtierpage.aspx?first=a&second=b
as well as 'www.mysite.com/3rdtierpage.aspx?second=b&first=a'
or you may have 'http://www.mysite.com/3rdtierpage.aspx?first=b&second=a'
All three of the above examples point to the same page, but cause major canonical issues.
-
Developers are trained to avoid any and all trouble to the users. Hence if the user types in a wrong parameter for some reason, the code will redirect him to a pre-decided page. Normally, this page should be a 404 page not found error. But quite often, it may not be. In this case, there is an original page that gets served up for the correct url and the redirected page that comes up for the redirected url.
-
Someone makes a copy of the entire content on my page and puts it up on his without a change. Google finds that page and penalizes you for having duplicate content.
For an article on how to fix Canonicalization issues, go to:
How to find and fix Canonicalization Issues.
Resources:
For the technically inclined, (N.e. R.d) :- http://www.sugarrae.com/be-a-normalizer-a-c14n-exterminator/
Here is one for you the Regular Joe: - http://www.conversationmarketing.com/2009/09/seo-101-canonicalization-1.htm