Search Engines avoids indexing multiple copies of the same content - this is what we call it “Duplicate Content“.
Not only does a search engine not index such pages, but it also penalises a site for having the duplicated content.
Having Duplicate Content will not improve your website rankings in any of the major search engines, therefore should be avoided.
There are two major cases of Duplicate Content:
1. Duplicate content as a result of Site Overall Structure
♦ Print-friendly pages
♦ Exactly same website content different domains (domain.com -> domain.ie or domain.net)
♦ Affiliate pages
♦ Navigation links and breadcrumb navigation
♦ Pages with similar content that can be accessed via different URLs
♦ Pages with items that are very similar in description and name, but they are different in (e-commerce catalogue) colour, size, etc.
♦ Pages with the same Title or Meta Tags values.
♦ Using URL-based session IDs
♦ Canonicalization problems. (eg. domain.ie versus www.domain.ie or /index.htm versus / )
Example using .htaccess to redirect domain.com to www.domain.com
RewriteEngine onRewriteCond %{HTTP_HOST} !^www.domain.com
RewriteRule ^(.*)$ http://www.domain.com/$1 [R=301,L]
There are times when a website has to contain duplicate content, as in the case of Printer friendly pages, which can be easily excluded from being indexed by SE using meta tag:
meta name="robots" content="noindex, nofollow"
You can also make use of the robots.txt file to exclude directories and files from being visited by search engines.
The robots.txt file should be placed in the root folder of your site and below you will find few basic things that will help you dealing with duplicated content:
#forbid all robots from your site
User-agent: *
Disallow: /
Disallow any URLs that start with a certain word:
#disallow ggogle from indexing URLs that starts with /blog ( note the leading / )
User-agent: googlebot
Disallow: /blog
#a particular page
Disallow: /page-name.html
You can also use wild-cards to disallow any URLs containing the sub-string of your choice ( in this case “print=”):
User-agent: googlebot
Disallow: /*print=
2. Duplicate content as a result of content theft
CopyScape is a service that helps you find content thieves by scanning for similar content contained by a given page on other pages.
If you are a victim to content theft, and want to take action, first let the individual using the content illicitly know by sending him a “Cease and Desist” letter, using the contact information you can gather from his website or in the WHOIS record of the domain name.
Failing that, the SE have procedures to alert them of stolen content:
Google: http://www.google.com/dmca.html
Yahoo!: http://docs.yahoo.com/info/copyright/copyright.html
MSN: http://search.msn.com/docs/siteowner.aspx?t=SEARCH_WEBMASTER_CONC_AboutDMCA.htm