Jump to content
bonbec

Google will ignore the noindex in the robots.txt as of September 1, 2019

Recommended Posts

An interesting reading:

https://webmasters.googleblog.com/2019/07/a-note-on-unsupported-rules-in-robotstxt.html


Get the latest Responsive osCommerce CE (community edition) here .

(Live   : OsC 2.2, php 5.4 & UTF-8  |  Local : Phoenix for future shop)

Share this post


Link to post
Share on other sites

Thank you for this information.



Regards
-----------------------------------------
Loïc

Contact me by skype for business
Contact me @gyakutsuki for an answer on the forum

 

Share this post


Link to post
Share on other sites

Should not be a problem most pages can simply replace noindex with Disallow: to stop google indexing. I personaly havent used noindex for a long time.


 

Share this post


Link to post
Share on other sites

you can't use a robots.txt disallow directive to stop google indexing. You have to use a noindex meta tag for that, which has nothing to do with robots.txt 

Share this post


Link to post
Share on other sites

Google says not to try to block using the robots file (see the "You should not use robots.txt...").  Their reason for the robots change is that they are trying to establish a standard, which they will probably achieve. So we all need to start adjusting our thinking to be what they want. :(

Share this post


Link to post
Share on other sites
3 minutes ago, Hotclutch said:

you can't use a robots.txt disallow directive to stop google indexing

I know it has nothing to do with indexing, it is however one of the recomended alternatives listed by google and I have been using it for years. As google says if you have content you dont wish to be seen  then you can pasword protect it or use disallow, if how ever you dont with it to be indexed but still wish it to be seen the you have to use one other the other alternatives. As allways if your not sure get professional help.

" For those of you who relied on the noindex indexing directive in the robots.txt file, which controls crawling, there are a number of alternative options:

  • Noindex in robots meta tags: Supported both in the HTTP response headers and in HTML, the noindex directive is the most effective way to remove URLs from the index when crawling is allowed.
  • 404 and 410 HTTP status codes: Both status codes mean that the page does not exist, which will drop such URLs from Google's index once they're crawled and processed.
  • Password protection: Unless markup is used to indicate subscription or paywalled content, hiding a page behind a login will generally remove it from Google's index.
  • Disallow in robots.txt: Search engines can only index pages that they know about, so blocking the page from being crawled usually means its content won’t be indexed.  While the search engine may also index a URL based on links from other pages, without seeing the content itself, we aim to make such pages less visible in the future.
  • Search Console Remove URL tool: The tool is a quick and easy method to remove a URL temporarily from Google's search results."

 

Share this post


Link to post
Share on other sites
16 minutes ago, JcMagpie said:
  • Disallow in robots.txt: Search engines can only index pages that they know about, so blocking the page from being crawled usually means its content won’t be indexed.  While the search engine may also index a URL based on links from other pages, without seeing the content itself, we aim to make such pages less visible in the future.

This is not true, and most often misunderstood.

If you have something in the index, then putting disallow in the robots.txt won't cause it to drop out of the index. In fact it will now stay there forever, because google cannot crawl the URL to see a noindex directive.

Alternatively, if you don't have something in the index, and you put a disallow in the robots.txt because you think it will prevent search engines from listing the content, then you're mistaken, because an external link to that URL will cause the search engine to still list the URL.

There are only 2 ways to prevent indexing.

1) meta noindex in the header.

2) 301 the URL

A URL that 404s, eventually drops out of the index, but search engines continue to crawl that URL indefinitely, with reduced frequency between crawls. And there's doubt as to how Google handles a 410 response code. 

Share this post


Link to post
Share on other sites

Thankyou for your feedback, I'm happy with my understanding of googles recomendations. Others will have to decide what's best for their website for themselves. As i said above...

1 hour ago, JcMagpie said:

As allways if your not sure get professional help.

It's not a big issue as all you need to do is turn on the Robot NoIndex header_tags module in CE, so most people should be fine.


 

Share this post


Link to post
Share on other sites

If this will implement in september then for what purpose we will use the robot.txt file.

Share this post


Link to post
Share on other sites
2 hours ago, Allen Solly said:

If this will implement in september then for what purpose we will use the robot.txt file.

The only thing i put in my robots.txt file is a link to sitemap. But putting disallow in the robots.txt can be useful if you're trying to optimise your crawl budget.

Share this post


Link to post
Share on other sites

Very useful information. Thank you

 

//PB

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×