Beware that here the URL is not redirected to a dedicated 404 page not found URL e.g.
The URL remains the same but displays information to the user that this specific URL does not exist on the website
As long as all deleted or invalid URLs get a respons header code like e.g. 404 'not found' or 410 'gone', then seach engines will know over time that these URLs should be removed from their index and no longer appear in SERPs.
But imagine that a webserver is configured incorrectly and giving all deleted URLs a response code 200 'ok'? Then each deleted URL would never be removed from search engines' indexes as they perceive all deleted URLs to be alive and well because they all have a response header code 200 'ok'. In addition all the deleted URL's contains the exact same content i.e. information that there is no content to display.
That can generate thousands upon thousands of duplicate pages as time goes by and more and more old pages get deleted. E.g. on a webshop where products might get sold out and eventually no longer get produced and thus never will be available again. If each of these deleted product URL's got a response header code 200 'ok' - due to a bad webserver configuration - they would all end up being identical - generating tons of duplicate content.
Always test deleted and invalid URLs on your website to see if they get a correct 404 'not found' or even better 410 'gone' (permantly deleted) response header code - so search engines know these URLs must be removed from their index and in addition prevent that the webserver generates a huge amount of duplicate content.
www, non-www, http:// and https:// pages
Using www as a 'subdomain' for your website is ok, however you should not set up your webserver to return content for both www and non-www. You need to choose one of the two and stick to it. If a browser or a search engine visits a page on the domain variant you choose not to use then they must be redirected (using a redirect of type 301 'permanently moved') to the domain variant you use. e.g. from www to non-www.
https://www.domain.com => https://domain.com
The same goes for http and https - your webserver should not be configured to return content on both protocols, instead you should only use the protocol with SSL encryption i.e. https and if a browser or a search engine visits a page on an unsecure http protocol they must also be redirected (type 301) from http to https, e.g.
http://domain.com => https://domain.com
Mobile website and desktop website
Today it is recommended to implement a website with a mobile friendly responsive design so the content on the page is displayed in a nice, user friendly and read friendly format no matter what type of device a user is using (desktop, tablet and/or phone). However, there are still websites out there that has a version of their website for desktop and tablet devices on the main domain and a version of their website for mobile devices on a subdomain e.g.
- Desktop and tablet devices: https://domain.com
- Mobile devices: https://m.domain.com
It is important to only allow search engines to index the subdomain to be used for mobile devices (due to Google's mobile first indexing). To prevent duplicate content between the two websites you must implement canonical on each URL on the desktop/table version of the website that points to the related page on the mobile version of the website e.g.
http://domain.com/some-article.html => http://m.domain.com/some-article.html
That will tell search engines that only the mobile version of the website should be indexed. If a user is using a desktop or tablet device and clicks on a page in a SERP that points to the mobile version of the website, it is perfectly fine to redirect such users automatically to the desktop/tablet version of the website.
Today most websites will print out a read-friendly version of a URL on a printer, without the need to load the content - to be printet - on a new print friendly URL.
However, you still find websites that has a print button/ikon/link and if you click it, the content on the page is loaded on a different URL where much of the content is removed (e.g. header, main menu and footer content) e.g.
- Normal URL: https://domain.com/a-how-to-guide.html
- Print friendly URL: https://domain.com/a-how-to-guide.html?print=yes
Here you must prevent duplicate content by preventing the print-friendly URLs from being allowed to be indexed in search engines.
You have two options here
- Implement a robots.txt filter to prevent search engines from crawing print-friendly URLs (more about robots.txt later)
- Implement canonical on the print friently URL that points back to the normal URL
- Implement meta-robots=noindex on the print-friendly URL
Each methos solves the issue with duplicate content.
This feature is not used much anymore, and I think that is good. The reason is that nobody should be able to send anonymous spam emails to others abusing a tip-a-friend feature on a website. In addition tip-a-frind features can produce a huge abmount of duplicate content.
Let us assume you run an online newspaper and you have a tip-a-friend feature implemented for each single article on the newspaper website. The tip-a-friend feature is not embeeded inside the article URL's but loads on a new unique URL. ON this new URL you have fields where you can type your email, your friends email, write a comment to your friend and tip your friend about a specific article from the online newspaper e.g.
- Normal URL: https://domain.com/article-about-a-local-fire-4586.html
- Tip-a-friend URL: https://domain.com/tip-a-friend.html?articleId=4586
If you have e.g. 17.300 articles on the online newspaper, then you also have 17.300 tip-a-friend URLs - one for each article
Usually these tip-a-friend URL are identical, the only thing that differs is e.g. the name and ID of the article. In addition tip-a-friend pages are often have very limited (thin) content. So all tip-a-friend URL's are 98-99% identical.
To prevent duplicate content tip-a-friend pages should be blocked from being crawled, and in addition they should be blocked from being indexed. You can do that by blocking search engines to read the tip-a-friend URLs via a robots.txt filter e.g.
The asterix at the end of the URL tells search engines not to crawl any URLs that begin with /tip-a-friend.html
Please beware: Robots.txt filters do not mean do-not-index, they mean do-not-crawl. So in order to make sure no tip-a-friend URLs ever get indexed in a search engine - should your Robots.txt filters get modified by a mistake - it is recommended to also add meta-robots=noindex in the
<head> section on all tip-a-friend URLs.
To prohibit search engines from crawling both print-friendly and tip-a-friend URLs also prevents wasting valuable Google crawl budget which can be important for very large websites.
Tag-pages (sub categories) on blogs
Tag-pages on e.g. a WordPress blog is a kind of sub-category pages. On blogs you tend to use categories for overall topics on your blog, e.g. 10-20 different overwall categories. Tags on the other hand are usually very different from each single blogpost because they are very specific in relation to the topic covered in each single blog post. E.g. if you run a blog about outdoor living where you write reviews about places to hike or guides how to choose hiking boots. If you e.g. make a test of some specific water proof hiking boots model TF178 from brand-X. Then you might add that blog post to your blog categories 'Gear' and 'Hiking'. However the tags you use for this blog post might be 'Water proof hiking boots', 'Model TF178 - Brand-X'.
It is very unlikely that you will ever again write another blog post where you reuse the two specific tags:
- Water proof hiking boots
- Model TF178 - Brand-X
However, it is very likely that you would write new blog posts that are relevant for the two overall categories:
On blog platfoms such as WordPress, categories and tags usually contain an excerpt of the content of each related blog post that are described and linked to on each single blog-category and blog-tag page.
For blog-categories you will hardly ever find two or more blog-category pages that are identical becuase they usually contain exerpts from and links to many diffent posts on the blog. There would be many blog posts about e.g. 'Gear' and 'Hiking'. Thus these blog-category would never be identical.
However for blog-tags, you will often see that each single blog-tag page only contains an exerpt and a link to one single blog post. The blog-tags 'Water proof hiking boots' and 'Model TF178 - Brand-X' will most likely their entire life span only contain one single blog-post excerpt and link. Thus the two blog-tag pages will be 99% identical
Blog-category pages are relevant for both users and search engines as that can help users to better find relevant blog posts and it can help search engines to better understand the topics that each blog post covers because you both provide content on the blog post and indicate relevant categories for the blog-post.
But Blog-tag pages are equally relevant for users and search engines, however, you risk generating duplicate content on all the blog-tag pages. Thus it is recommended to allow search engines to crawl blog-tag pages, but prevent them from being indexed. Thus you should not set up any filters in robots.txt for blog-tag pages, but you should set up meta-robots=noindex (that equals noindex,follow). That will allow search engines to crawl the pages and analyse the blog-tags used for each blog post, but you prevent the blog-tag pages themselves from being indexed and thus prevent creation of duplicate content. For a WordPress blog it is very easy to set up such crawl and indexing rules for both blog-category and blog-tag pages.
Root and subfolder index-pages
If you e.g. visit a website's home page and it is loaded from the root (/) folder. If the website is built using PHP, then the home page could also be loaded via the index.php file. Webservers can be set up to look for index-files in the root or other folders and load them when a user or a search engine requests content from folders e.g.
A webserver can be set up to display content both with and without using e.g. index.php or default.aspx when requsting the home page or a page in a subfolder.
Beware, you might also have a website where you have implemented e.g. a MVC framework, where you e.g. put all logic into controllers that then parses the content all the way to 'dumb' view files. On e.g. MVC based websites you will not have e.g. an index.php in every subfolder, you will have URL rewrite URLs that trigger loading and execution of controllers and later display content via view files.
So I have to be carefull not to allow e.g. my home page to display content both on e.g.
The reason is that Google and other search engines will perceive that as two completely differnet URLs, but with the exact same content.
To prevent such duplicate content on this blog I've implemented a check that looks at the requested URL on e.g. the home page, if it contains index.php, then the user or a search engine is redirected (type 301 'permanently moved') to the home page URL (/) ie. without /index.php. In fact it does not matter if you type / or /index.php or /?urlparameter=hi or similar URLs for the home page, it will redirect users and search engines to the root (/) URL.
The best solution to prevent duplicate content on index-pages is to use redirects and not use canonical nor use meta-robots=noindex
URL case insensitive webservers
If you host your website on a Microsoft IIS webserver. It will most likely be URL case insensitive.
This means that you can load content on URLs no matter how you combine the usage of small and capital letters in the URL e.g.
No matter how you combine the letters in the URL the webserver might return content. This risk generating duplicate content because Google and the other major serach engines consider the 4 URL examples above as 4 different URLs with the exact same content.
It is recommended to either redirect the user or a search engine to the URL you determine to be valid (e.g. URLs only using lowercase letters). You can also implement canonical on each page with the valid URL (also typically using a pure lower case URL).
Same content in same language used across domains
If you e.g. have a webshop that is hosted on different TLD (e.g. .com and .net) and ccTLD, e.g. .co.uk or .de
- UK: https://domain.co.uk
- US: https://domain.com
- Ireland: https://domain.ie
- Canada: https://domain.ca
Previously you could end up producing massive amounts of duplicate content accross these domains with the same english content. The only content that might differ from each single product URL accross the domains could be the price in a country currency or federal specic currency.
However, after Google implemented
hreflang to embed in both <head> section as well as in XML sitemap files, you no longer risk duplicate content. The reason is that hreflang informs search engines that a URL might have one or more sibling URLs in the same language or in different languages, e.g.
English in UK:
<link rel="alternate" hreflang="en-GB" href="https://domain.co.uk" />
English in US:
<link rel="alternate" hreflang="en-US" href="https://domain.com" />
English in Ireland:
<link rel="alternate" hreflang="en-IE" href="https://domain.ie" />
English in Canada:
<link rel="alternate" hreflang="en-CA" href="https://domain.ca" />
<link rel="alternate" hreflang="en" href="https://domain.com" />
<link rel="alternate" hreflang="x-default" href="https://domain.com" />
Technical options to prevent duplicate content
Above is already mentioned the methods you can use to prevent duplicate content on your website, but here they will be summarized