grosen.com - blog about Technical SEO

It makes sense that search engines want owners of websites to eliminate or reduce the amount of identical or nearly identical content on their websites. The reason is that:

It is much easier for search engines to proper read through and index content on the Internet

if..

specific/unique content is only found on one single URL.

Thus most major search engines reward a website, that prevents identical content from appearing on two or more URLs on the same website or accross different webistes.

Preventing duplicate content is an ongoing challange and it is often solved using technical SEO techniques.

How to prevent duplicate content?

To make sure that a piece of content only appear on one single URL can be accomplished in 3 ways

  1. Content is found only on one specific URL on the Internet
  2. Content is found on multiple URLs, but only one of the URLs are ACTUALLY allowed to be indexed
  3. Content is found on multiple URLs, but only one URL is set up to be the original / master, while all other URLs with the same content, are set up to be copies of the master / original URL

How you solve that using technical SEO, I'll get back to later.

Duplicate Content - a technical SEO challenge
 

Why should you prevent duplicate content on your website?

There are several advantages for websites that prevent duplicate content

  • You comply with Google's and many other major search engines' guidelines
  • Should your website be listed in a search engine's result page (SERP) - when users search for a specific relevant keyword or phrase - you want to gain as much control over what specific URL from your website that gets displayed in the SERP for that keyword/phrase. And you gain as much control over that - by eliminating duplicate content - and making sure that only your preferred URLs are getting indexed in the search engines.
  • All internal PageRank plus the value of anchor text or - the value of the text in the alt-attribute on an image link - as well as other other signals - are sent to one specific and strong URL - and not spread across multiple URLs that contain the same content.

Where can duplicate content appear?

Very often you read or hear that it is recommended to prevent duplicate content, BUT, duplicate content is not only about the main text on a URL that users can read in their browser. It is important to know, that it can appear in up to 3 different fields on a URL:

  • In 3 different fields on two or more URLs on the same domain
  • In 1 field across two or more URLs on different domains

The 3 different fields mentioned above that can contain duplicate content are the following:

  1. Title tag
  2. Meta-Description tag
  3. Main text

and you need to know exactly what fields to focus on - because it depends what you are working on - is it content on the same domain or content on different domains

Question: Can there be duplicate content in the following fields ?

  Title tag Meta-description tag Main text
Same domain Yes Yes Yes
Different domains No No Yes
 

THUS: You do not have to worry about double content in titles and meta descriptions on two or more URLs across different domains, but you have to be very careful if it appears on two or more URLs on the same domain.

Why does Google focus on title, meta-description as well as main text in relation to duplicate content ?

To me it is obvious, that the main text on a URL, is something that search engines should like to only find on only one single URL.

However, Google and the other major search engines do not only focus on the main text on a URL, they also focus on the two fields title and meta-description.

The reason is that the content that appears in these two fields - is often used by search engines - to describe a URL in SERPs.

  • Title tags are often used as a URL's headline in SEPRs
  • Similarly, meta-description is often used as a URL's description/excerpt in SERP's

Here you can see the title tag and meta-description appear in the <head> section in HTML code:

Example: Title and Meta Description appear in HTML code
 

And here you can see the content of the two fields are used to describe the URL in a Google SERP.

Example: Title and Meta Description are used to describe URL in a SERP
 

However, there is absolutely no guarantee that the content in a URL's title and meta-description will be used to describe a URL in a SERP

But if you have good and unique content in these two fields, then you increase the chances that Google and other major search engine will use that content in their SERPs - and you should certainly aim for that - because then you minimize the risk, that search engines will make their own headline and description/excerpt for a URL from your website in their SERPs.

Meta description:

The content of meta-description has no effect as a ranking factor. However, it can have a significant effect on CTR if you write a relevant text in meta-description that includes a clear CTA.

Title:

The opposite is true for the content in the title tag, what you write here has an impact on what keywords a URL will get good rankings for in SERPs.

Thus, these two fields are very important for a URL, the reason is they DESCRIBE a URL in SERPs and combined they can have a huge impact on both rankings and CTR in SERPs

So:

  • if you have two or more URLs on the same domain that have the same content in title and/or meta-description, and
  • all the URLs compete:
    • to be read by
    • indexed by
    • and get good organic rankings in search engines

My question to you: "Which one" of these URLs - with identical title and meta-description - should search engines choose to rank and display in their SERPs?

Which one?

My recommendation:

  • You should never let any search engine make that decision for you
  • YOU should be in control of that as much as possible.
  • You do that by removing any doubt about what URLs - from your website - that search engines should display in specific SERPs

That requires that you have uniqe content in both title and meta-description for each single URL on your website

Now - Where exactly are title, meta-description and main text located in a URL's content ?

A URL has two major sections, the <head> and the <body> section.

  • <head> section: is not visible to users unless they specificly chooses to look it up in their browser, however that section is always read and used by both browsers and search engines. The reason is that here links to e.g. JavaScript code and CSS style sheets reside - all needed to organize and display content properly on a URL
  • <body> section: Only the content in the <body> section is displayed to a user in his/her browser

Both title and meta-description appear in the <head> section, and the main text appear in the <body> section.


<html>

<head>
	<title>URL title - used as headline i SERPs</title>
	<meta name="description" content="URL excerpt or short description about the content - used as description in SERPs"/>
	
</head>

<body>

	Main-text

</body>
</html>
 

What type of pages are known to generate duplicate content?


Paginated category pages

Category pages that only display a small fraction of all the products available within a category - are known to generate duplicate content. E.g. if a category contains 210 products and you display 50 products per category pages, then the entire category will range accros 5 paginated pages e.g.

  • 001-050: domain.com/category.html
  • 051-100: domain.com/category.html?page=2
  • 101-150: domain.com/category.html?page=3
  • 151-200: domain.com/category.html?page=4
  • 201-210: domain.com/category.html?page=5

If such paginated pages are not handled correctly, they risk displaying:

The same content in the title, meta-descripiton, H1 headline and some of the main text.

The only content that differs are the 50 different products on each paginated category page.

Unfortunately, on Google, you can no longer prevent duplicate content using pagination

Prevously Google supported rel="prev/next" in the <head> section on paginated category pages, e.g. for paginated page 4


<link rel="prev" href="/category.html?page=3" />
<link rel="next" href="/category.html?page=5" />

Page 4 points to page 3 as its previous page and to page 5 at its next page

Beware, there were special pagination-rules for the very first and last page with pagination:

  • The very first page only had a "next" pointing to page 2, because page 0 does not exist.
  • The very last page (here it would be page 5) only had a "prev" pointing back to page 4, because page 6 does not exist

Now, the abosolutely beauty of pagination was (if fully supported by a search engine):

  • You could perceive all the paginated category pages to be one-single-but-very-loooong-category-page and thus - it was ok - that the paginated category pages had the same content in:
    • Title
    • Meta-description
    • Main text (including H1 headline)
  • Click depth could be perceived as starting from the very first of the paginated category pages, so even if a product appears on paginated category page 5, its click depth did not have to be +6 clicks away from the home page, but only 2 clicks away from the home page.
    • Before, supporting pagination:
      1. home page => 1st click =>
      2. paginated category pages => (no matter how deep) 2nd click =>
      3. product page
    • After, no longer supporting pagination:
      1. home page => 1st click =>
      2. paginated category page 1 => 2nd click =>
      3. paginated category page 2 => 3rd click =>
      4. paginated category page 3 => 4th click =>
      5. paginated category page 4 => 5th click =>
      6. paginated category page 5 => 6th click =>
      7. product page
    Beware, all this about pagination and click-depth, I never saw/read/heard Google confirm, this is my personal perception of how major search engines like Google embraced support for pagination - besides preventing duplicate content.

Unfortunately Mr. John Mueller confirmed in a Google Webmaster Hangout in 2019, that Google no longer supports pagination, Google instead now perceive each paginated category page as individual pages.

Warning: This does NOT mean, that you should remove rel="next/prev" from your paginated pages, becuause other major search engines like Bing and Yandex might still support pagination.

However, this - in my opinion - gives you an extra challenge - in order to prevent duplicate content on paginated category pages. However, to prevent that, you can consider implementing the following

  • Append the page number to both the title and meta-description - not on very first page - but on paginated page 2 and forward.
  • Consider taking advantage of spinning content for headlines and main text - that would make them unique across all paginated pages.

    Full disclosure: I know that spinning content is not recommended by Google at all.

    But how do you manually write uniqe content for e.g. thousands or millions of paginated category pages?

    How?

    However, if you spin content of a very high quality, that can be used successfully to prevent duplicate content. Beware, you must spend much time generating the spintax base to be used to spin uniqe content, and much of that time will be used to proof-read the final content, to remove bad sentences and to correct spelling errors. Back when I was an external technical SEO consultant - I've helped clients spinning content, and here we had to spend a total of approx 13 man hours to produce a spintax base that could be used to spin/produce readable content - that consists of approx 100 words - just to give you some information about the effort you need to put into such a project.
  • Make sure you output each product image with text in the image's alt-attribute, that you output product name and price for each single product listed on a category page.

    Do not forget, that these types of product data are actually the unique part of a paginated page's main text.

Beware, I do NOT recommened implementing canonical on paginated category page 2, 3 etc that points back to the very first of the paginated category pages

  • There are e.g. 50 different products on each paginated category page, and they are in no way identical enough to justify implementing canonical
  • You risk sending to much interal Page Rank and other signals from paginated category page 2, 3 etc back to the very first of the paginated pages and not letting any internal PageRank flow through and down to the product pages/URLs that are linked to from the paginated category pages

Beware: Implementing canonical on paginated pages is only recommended if each paginated category page points to iself, e.g.

  • page 2 points to itself
  • page 3 points to itself
  • etc.
That e.g. can be very useful to prevent duplicate content on webservers with case insensitive URLs. Case insensitive URLs and how to implement canonical on a page to prevent duplicate content - I'll describe in detail later.


Category pages with filtering

Webshops help their users find the product(s) they are searching for by e.g. implementing a very strong internal search engine as well as providing filtering of the products on category pages, e.g.

  • Type
  • Size
  • Color
  • Material
  • Brand
  • Gender
  • Price range
  • Etc.

Very often filtering on category pages are implemented using URL parameters e.g.

  • domain.com/category-a?gender=f&size=42&colorCode=b56
  • domain.com/category-b?type=computer&brandId=5732&maxPrice=550

Such filtering can however generate an endless amount of duplicate content due to all the different ways you can combine the filters. And you do not want such URLs to be neither crawled nor indexed by search engines.

Category filtering I often label "The never ending indexing" as you can combine filtering options in perhaps not just in thousands but in millions of ways.

If you do not prevent search engines from both crawling and indexing filtered category pages you risk duplicate contant and equally important you risk wasting valuable Google crawl budget - this is very important to prevent for very large webshops with extensive filtering options on their category pages.

Here I recommend that you combine either robots.txt filtering to prevent crawling of filtered category pages as well as adding meta-robots=noindex,nofollow on all filtered category pages that contains 1 or more URL parameters that is used to filter the products on category URLs.

However, you should allow search engines to crawl and index category URLs with no URL parameters used for filtering or with only pagination URL parameter (e.g. page=7)

  • Allow crawling and indexing: domain.com/category-a
  • Allow crawling and indexing: domain.com/category-a?page=7
  • Block crawling and indexing: domain.com/category-b?type=computer&brandId=5732&maxPrice=550
  • Block crawling and indexing: domain.com/category-a?page4&brandId=75668

Here you can add the following Disallow filters to robots.txt, e.g.

  • Disallow: /*?type=*
  • Disallow: /*&type=*
  • Disallow: /*?brandId=*
  • Disallow: /*&brandId=*
  • Disallow: /*?maxPrice=*
  • Disallow: /*&maxPrice=*

If you do not have access to alter your robots.txt file, you can also set up rules for URL parameters in a classic tool now part of the new Google Search Console. Here you can modify the configuration for URL parameters that Google have already found on your website and you can ad URL parameters yourself. After that you can specify what the URL parameters do on your website. E.g. URL parameters only used to handle tracing e.g. gclid= or fblid= URL parameters. Or e.g. URL parameters only used to narrow down the content on a category page.

Showing you how to set up and configure URL parameters in via a classic tool in the new Google Search Console is beyond the scope of this blog post, but I might consider covering that in another blog post in the future.


Product variant pages

On some web shops each product variant will appear on a unique URL, e.g. product pages for fishing hooks in different sizes could have a uniqe URL for each fishing hook size. e.g.

  • Fishing hook size 1: domain.com/fishing-hook.html-4561.html
  • Fishing hook size 2: domain.com/fishing-hook.html-5681.html
  • Fishing hook size 4: domain.com/fishing-hook.html-5699.html
  • Fishing hook size 6: domain.com/fishing-hook.html-6558.html
  • Fishing hook size 8: domain.com/fishing-hook.html-6859.html

Compared to the paginated category pages described above, each of these fishing product pages are so identical, that the only content that typically differ are the product variants (here fishing hook size) and price. Thus they are so identical - that using canonical is justified. So here you would perceive product page with fishing hook size 1 as the original/master page and thus all other pages (i.e. fishing hook size 2 and greater) would have canonical that points back to the chosen master product page (fishin hook size 1)


"Not found" pages

Each page on a website has a response header code that is sent to and read by both browsers and search engines, each resonse header informs about the state of each URL. Here is a list of the most commonly seen response header codes

  • 200 - Page content is ok
  • 301 - Page permanently moved to a new URL
  • 302 - Page temporarity moved to a another URL
  • 404 - Page not found
  • 410 - Page permanently deleted
  • 500 - Server error

If you want to see a more complete list of all the different responser header codes you can find it here

To read a URLs response header you can use an online tool or you can inspect a URL in your browser and select the 'Network' tab and then reload the URL and see the response code e.g.

https://politiken.dk/this-page-is-not-real

Use browser inspection to find 404 response header code
 

Beware that here the URL is not redirected to a dedicated 404 page not found URL e.g.

domain.com/404.html

The URL remains the same but displays information to the user that this specific URL does not exist on the website

As long as all deleted or invalid URLs get a respons header code like e.g. 404 'not found' or 410 'gone', then seach engines will know over time that these URLs should be removed from their index and no longer appear in SERPs.

But imagine that a webserver is configured incorrectly and giving all deleted URLs a response code 200 'ok'? Then each deleted URL would never be removed from search engines' indexes as they perceive all deleted URLs to be alive and well because they all have a response header code 200 'ok'. In addition all the deleted URL's contains the exact same content i.e. information that there is no content to display.

That can generate thousands upon thousands of duplicate pages as time goes by and more and more old pages get deleted. E.g. on a webshop where products might get sold out and eventually no longer get produced and thus never will be available again. If each of these deleted product URL's got a response header code 200 'ok' - due to a bad webserver configuration - they would all end up being identical - generating tons of duplicate content.

Always test deleted and invalid URLs on your website to see if they get a correct 404 'not found' or even better 410 'gone' (permantly deleted) response header code - so search engines know these URLs must be removed from their index and in addition prevent that the webserver generates a huge amount of duplicate content.


www, non-www, http:// and https:// pages

Using www as a 'subdomain' for your website is ok, however you should not set up your webserver to return content for both www and non-www. You need to choose one of the two and stick to it. If a browser or a search engine visits a page on the domain variant you choose not to use then they must be redirected (using a redirect of type 301 'permanently moved') to the domain variant you use. e.g. from www to non-www.

https://www.domain.com => https://domain.com

The same goes for http and https - your webserver should not be configured to return content on both protocols, instead you should only use the protocol with SSL encryption i.e. https and if a browser or a search engine visits a page on an unsecure http protocol they must also be redirected (type 301) from http to https, e.g.

http://domain.com => https://domain.com


Mobile website and desktop website

Today it is recommended to implement a website with a mobile friendly responsive design so the content on the page is displayed in a nice, user friendly and read friendly format no matter what type of device a user is using (desktop, tablet and/or phone). However, there are still websites out there that has a version of their website for desktop and tablet devices on the main domain and a version of their website for mobile devices on a subdomain e.g.

  • Desktop and tablet devices: https://domain.com
  • Mobile devices: https://m.domain.com

It is important to only allow search engines to index the subdomain to be used for mobile devices (due to Google's mobile first indexing). To prevent duplicate content between the two websites you must implement canonical on each URL on the desktop/table version of the website that points to the related page on the mobile version of the website e.g.

http://domain.com/some-article.html => http://m.domain.com/some-article.html

That will tell search engines that only the mobile version of the website should be indexed. If a user is using a desktop or tablet device and clicks on a page in a SERP that points to the mobile version of the website, it is perfectly fine to redirect such users automatically to the desktop/tablet version of the website.


Print-friendly pages

Today most websites will print out a read-friendly version of a URL on a printer, without the need to load the content - to be printet - on a new print friendly URL.

However, you still find websites that has a print button/ikon/link and if you click it, the content on the page is loaded on a different URL where much of the content is removed (e.g. header, main menu and footer content) e.g.

  • Normal URL: https://domain.com/a-how-to-guide.html
  • Print friendly URL: https://domain.com/a-how-to-guide.html?print=yes

Here you must prevent duplicate content by preventing the print-friendly URLs from being allowed to be indexed in search engines.

You have two options here

  1. Implement a robots.txt filter to prevent search engines from crawing print-friendly URLs (more about robots.txt later)
  2. Implement canonical on the print friently URL that points back to the normal URL
  3. Implement meta-robots=noindex on the print-friendly URL

Each methos solves the issue with duplicate content.


Tip-a-friend pages

This feature is not used much anymore, and I think that is good. The reason is that nobody should be able to send anonymous spam emails to others abusing a tip-a-friend feature on a website. In addition tip-a-frind features can produce a huge abmount of duplicate content.

Let us assume you run an online newspaper and you have a tip-a-friend feature implemented for each single article on the newspaper website. The tip-a-friend feature is not embeeded inside the article URL's but loads on a new unique URL. ON this new URL you have fields where you can type your email, your friends email, write a comment to your friend and tip your friend about a specific article from the online newspaper e.g.

  • Normal URL: https://domain.com/article-about-a-local-fire-4586.html
  • Tip-a-friend URL: https://domain.com/tip-a-friend.html?articleId=4586

If you have e.g. 17.300 articles on the online newspaper, then you also have 17.300 tip-a-friend URLs - one for each article

Usually these tip-a-friend URL are identical, the only thing that differs is e.g. the name and ID of the article. In addition tip-a-friend pages are often have very limited (thin) content. So all tip-a-friend URL's are 98-99% identical.

To prevent duplicate content tip-a-friend pages should be blocked from being crawled, and in addition they should be blocked from being indexed. You can do that by blocking search engines to read the tip-a-friend URLs via a robots.txt filter e.g.

Disallow: /tip-a-friend.html*

The asterix at the end of the URL tells search engines not to crawl any URLs that begin with /tip-a-friend.html

Please beware: Robots.txt filters do not mean do-not-index, they mean do-not-crawl. So in order to make sure no tip-a-friend URLs ever get indexed in a search engine - should your Robots.txt filters get modified by a mistake - it is recommended to also add meta-robots=noindex in the <head> section on all tip-a-friend URLs.

To prohibit search engines from crawling both print-friendly and tip-a-friend URLs also prevents wasting valuable Google crawl budget which can be important for very large websites.


Tag-pages (sub categories) on blogs

Tag-pages on e.g. a WordPress blog is a kind of sub-category pages. On blogs you tend to use categories for overall topics on your blog, e.g. 10-20 different overwall categories. Tags on the other hand are usually very different from each single blogpost because they are very specific in relation to the topic covered in each single blog post. E.g. if you run a blog about outdoor living where you write reviews about places to hike or guides how to choose hiking boots. If you e.g. make a test of some specific water proof hiking boots model TF178 from brand-X. Then you might add that blog post to your blog categories 'Gear' and 'Hiking'. However the tags you use for this blog post might be 'Water proof hiking boots', 'Model TF178 - Brand-X'.

It is very unlikely that you will ever again write another blog post where you reuse the two specific tags:

  • Water proof hiking boots
  • Model TF178 - Brand-X

However, it is very likely that you would write new blog posts that are relevant for the two overall categories:

  • Gear
  • Hiking

On blog platfoms such as WordPress, categories and tags usually contain an excerpt of the content of each related blog post that are described and linked to on each single blog-category and blog-tag page.

For blog-categories you will hardly ever find two or more blog-category pages that are identical becuase they usually contain exerpts from and links to many diffent posts on the blog. There would be many blog posts about e.g. 'Gear' and 'Hiking'. Thus these blog-category would never be identical.

However for blog-tags, you will often see that each single blog-tag page only contains an exerpt and a link to one single blog post. The blog-tags 'Water proof hiking boots' and 'Model TF178 - Brand-X' will most likely their entire life span only contain one single blog-post excerpt and link. Thus the two blog-tag pages will be 99% identical

Blog-category pages are relevant for both users and search engines as that can help users to better find relevant blog posts and it can help search engines to better understand the topics that each blog post covers because you both provide content on the blog post and indicate relevant categories for the blog-post.

But Blog-tag pages are equally relevant for users and search engines, however, you risk generating duplicate content on all the blog-tag pages. Thus it is recommended to allow search engines to crawl blog-tag pages, but prevent them from being indexed. Thus you should not set up any filters in robots.txt for blog-tag pages, but you should set up meta-robots=noindex (that equals noindex,follow). That will allow search engines to crawl the pages and analyse the blog-tags used for each blog post, but you prevent the blog-tag pages themselves from being indexed and thus prevent creation of duplicate content. For a WordPress blog it is very easy to set up such crawl and indexing rules for both blog-category and blog-tag pages.


Root and subfolder index-pages

If you e.g. visit a website's home page and it is loaded from the root (/) folder. If the website is built using PHP, then the home page could also be loaded via the index.php file. Webservers can be set up to look for index-files in the root or other folders and load them when a user or a search engine requests content from folders e.g.

  • https://domain.com/
  • https://domain.com/index.php
  • https://domain.com/category-a/
  • https://domain.com/category-a/default.asps

A webserver can be set up to display content both with and without using e.g. index.php or default.aspx when requsting the home page or a page in a subfolder.

Beware, you might also have a website where you have implemented e.g. a MVC framework, where you e.g. put all logic into controllers that then parses the content all the way to 'dumb' view files. On e.g. MVC based websites you will not have e.g. an index.php in every subfolder, you will have URL rewrite URLs that trigger loading and execution of controllers and later display content via view files.

However, this blog is e.g. hand-coded using PHP, HTML, CSS and JavaScript (no database involved) and each single URL on this blog has its own .php file that contains the content, here no MVC framework is used.

So I have to be carefull not to allow e.g. my home page to display content both on e.g.

  • https://grosen.com/
  • https://grosen.com/index.php

The reason is that Google and other search engines will perceive that as two completely differnet URLs, but with the exact same content.

To prevent such duplicate content on this blog I've implemented a check that looks at the requested URL on e.g. the home page, if it contains index.php, then the user or a search engine is redirected (type 301 'permanently moved') to the home page URL (/) ie. without /index.php. In fact it does not matter if you type / or /index.php or /?urlparameter=hi or similar URLs for the home page, it will redirect users and search engines to the root (/) URL.

The best solution to prevent duplicate content on index-pages is to use redirects and not use canonical nor use meta-robots=noindex


URL case insensitive webservers

If you host your website on a Microsoft IIS webserver. It will most likely be URL case insensitive.

This means that you can load content on URLs no matter how you combine the usage of small and capital letters in the URL e.g.

  • https://domain.com/category-a
  • https://domain.com/cAtEgOrY-a
  • https://domain.com/CaTeGoRy-A
  • https://domain.com/CATEGORY-A

No matter how you combine the letters in the URL the webserver might return content. This risk generating duplicate content because Google and the other major serach engines consider the 4 URL examples above as 4 different URLs with the exact same content.

It is recommended to either redirect the user or a search engine to the URL you determine to be valid (e.g. URLs only using lowercase letters). You can also implement canonical on each page with the valid URL (also typically using a pure lower case URL).


Same content in same language used across domains

If you e.g. have a webshop that is hosted on different TLD (e.g. .com and .net) and ccTLD, e.g. .co.uk or .de

  • UK: https://domain.co.uk
  • US: https://domain.com
  • Ireland: https://domain.ie
  • Canada: https://domain.ca
  • Etc.

Previously you could end up producing massive amounts of duplicate content accross these domains with the same english content. The only content that might differ from each single product URL accross the domains could be the price in a country currency or federal specic currency.

However, after Google implemented hreflang to embed in both <head> section as well as in XML sitemap files, you no longer risk duplicate content. The reason is that hreflang informs search engines that a URL might have one or more sibling URLs in the same language or in different languages, e.g.

English in UK: <link rel="alternate" hreflang="en-GB" href="https://domain.co.uk" />
English in US: <link rel="alternate" hreflang="en-US" href="https://domain.com" />
English in Ireland: <link rel="alternate" hreflang="en-IE" href="https://domain.ie" />
English in Canada: <link rel="alternate" hreflang="en-CA" href="https://domain.ca" />
English only: <link rel="alternate" hreflang="en" href="https://domain.com" />
X-default (?): <link rel="alternate" hreflang="x-default" href="https://domain.com" />


Technical options to prevent duplicate content

Above is already mentioned the methods you can use to prevent duplicate content on your website, but here they will be summarized

Option Prevent
crawling
Prevent
Indexing
Example
1) Robots.txt Yes No Disallow: /*&brandId=*
2) GSC Yes No N/A
3) Noindex No Yes <meta name="robots" content="noindex" />
4) Noindex,Nofollow Yes Yes <meta name="robots" content="noindex, nofollow" />
5) Redirects No Yes (*) N/A
6) Canonical No Yes (*) <link rel="canonical" href="https://domain.com" />
7) Pagination No (*) No (*) <link rel="prev" href="/category.html?page=3" />
<link rel="next" href="/category.html?page=5" />
8) Hreflang No No <link rel="alternate" hreflang="en-GB" href="https://domain.co.uk" />
<link rel="alternate" hreflang="en-US" href="https://domain.com" />
 
  • Ad 1) E.g. filtering on category pages. Will both prevent crawling and will also prevent risk of wasting crawl budget.
  • Ad 2) E.g. filtering on category pages. Will both prevent crawling and will also prevent risk of wasting crawl budget.
  • Ad 3) E.g. on content you do not want to be indexed e.g. blog-tag pages on a WordPress blog
  • Ad 4) E.g. on content you do not want to be neither crawled nor indexed e.g. tip-a-friend URLs
  • Ad 5) E.g. for www versus non-www or http versus https. (*) URL's that are redirected might be remembered by Google for a long time, they might appear in SERPs, but with proper redirects you still prevent duplicate content on your website.
  • Ad 6) E.g. for URL case insentivie URLs or product variants that appear on unique URLs (*) Canonical is not an instruction to search engines the same way e.g. meta-robots=noindex is. Canonical is perceived by Google as a signal about URLs that they might consider following.
  • Ad 7) (*) Pagination is no longer supported by Google, but it is still included on this list as it might still be supported by other major search engines like Bing and Yandex.
  • Ad 8) Hreflang is not meant to block crawling nor block indexing. It is meant to both prevent duplicate content as well as instruct search engines about what:
    • Language/Country versions or
    • Language versions
    of different websites are relevant to display in SERPs. E.g. domain.ca is most relevant to rank in SERPs for users located in Canada compared to e.g. domain.com.
 

Tools to help you find duplicate content

  • Siteliner.com can help you find duplicate content in titles, meta-description and main text
  • Screaming Frog SEO spider can help you find duplicate content in titles and meta-descrition (not in main text)

Credits

Thank you to Pixabay for usage of free image: https://pixabay.com/photos/stamp-duplicate-office-867733/


© Grosen Friis | Kastanielunden 67, Kvaglund | DK-6705 Esbjerg Ø | Denmark | Ph. +45 2578 6784 | grosen@grosen.com