Skip to main content

Why Google Indexes Blocked Web Pages

 Why Google Indexes Blocked Web Pages

Google's John Mueller explains why disallowed pages are sometimes indexed and that related Search Console reports can be dismissed

Why Google Indexes Blocked Web Pages


Google’s John Mueller answered a question about why Google indexes pages that are disallowed from crawling by robots.txt and why the it’s safe to ignore the related Search Console reports about those crawls.

Bot Traffic To Query Parameter URLs

The person asking the question documented that bots were creating links to non-existent query parameter URLs (?q=xyz) to pages with noindex meta tags that are also blocked in robots.txt. What prompted the question is that Google is crawling the links to those pages, getting blocked by robots.txt (without seeing a noindex robots meta tag) then getting reported in Google Search Console as “Indexed, though blocked by robots.txt.

Takeaways:

1. Confirmation Of Limitations Of Site: Search

Mueller’s answer confirms the limitations in using the Site:search advanced search operator for diagnostic reasons. One of those reasons is because it’s not connected to the regular search index, it’s a separate thing altogether.

Google’s John Mueller commented on the site search operator in 2021:

The site operator doesn’t reflect Google’s search index, making it unreliable for understanding what pages Google has indexed or note indexed. Like Google’s other advanced search operators, they are unreliable as tools for understanding anything related to how Google ranks or indexes content.

2. Noindex tag without using a robots.txt is fine for these kinds of situations where a bot is linking to non-existent pages that are getting discovered by Googlebot. Noindex tags on pages that are not blocked by a disallow in the robots.txt allows Google to crawl the page and read the noindex directive, ensuring the page won’t appear in the search index, which is preferable if the goal is to keep a page out of Google’s search index.

3. URLs with the noindex tag will generate a “crawled/not indexed” entry in Search Console and won’t have a negative effect on the rest of the website.
These Search Console entries, in the context of pages that are purposely blocked, only indicate that Google crawled the page but did not index it, essentially saying that this happened, not (in this specific context) that there’s something wrong that needs fixing.

This entry is useful for alerting publishers for pages that are inadvertently blocked by a noindex tag,  or by some other cause that’s preventing the page from being indexed. Then it’s something to investigate

4. How Googlebot handles URLs with noindex tags that are blocked from crawling by a robots.txt disallow but are also discoverable by links.
If Googlebot can’t crawl a page, then it’s unable to read and apply the noindex tag, so the page may still be indexed based on URL discovery from an internal or external link.

Google’s documentation of the noindex meta tag has a warning about the use of robots.txt to disallow pages that have a noindex tag in the meta data:

5. How site: searches differ from regular searches in Google’s indexing process
Site: searches are limited to a specific domain and are disconnected from the primary search index, making them not reflective of Google’s actual search index and less useful for diagnosing indexing issues.

Comments

Popular posts from this blog

What is SEO & How Does SEO Work?

 What is SEO & How Does SEO Work? SEO stands for Search Engine Optimization, a digital marketing strategy that focuses on improving your website position in search results on search engine like google. Since search engine used hundreds of factors to generate search results Google uses more than 200 SEO works by optimising  your website for these factors, along with getting your sites crawled and indexed. When you come to know how SEO works you can use the other tactics, like keyword research, content creation and page speed optimisation, to increase your visibility in search result. How does SEO works for search engines? Search engine optimization is the product of search engines- search engine companies develop ranking factors and use those factors to determine the most relevant content for the search. However, before search engine can determine the most relevant content, it must crawl and index it. Crawling: Search Engine use crawling to discover and re-discover new and ...

Google: Avoid Duplicate Content In Business Profile Posts

Duplicate content in post published by Google Business Profiles is now considered as spam, according to the content policy updates. Google added a line to its Business Profile post content policy under the section cautioning users to avoid spam. Among the list of content users should avoid posting on their Business Profile. It is important to be aware of what this change means for our content strategy so you don't find yourself in violation of the updated policy.  NO DUPLICATE CONTENT IN GOOGLE POSTS.  With this update content policy, Google 8s sending a clear message that it wants businesses to post unique content. Posting the same photo, video, or text block isn't allowed. If you aren't careful, breaking google new rule is possible, even if you are not trying to spam. In addition, it sounds like Google wants businesses to limit their use of logos putting a logo on every photo, for example, may get you in trouble. The best way to stay within the guidelines is to ensure eve...

What is Interaction to Next Paint ?

All we need to know about Interaction To Next Paint (INP) Interaction to Next Paint (INP) 8s an experimental metric that evaluate responsiveness. (INP) observers the latency of all interactions a user has made with the page, and reports a single value which all interactions were below. A low INP means the page was consistently able to respond quickly to all of the users interactions. The usage data of chrome shows' that 90% of the user's time on a page is spent after it loads. Thus precautions of responsiveness throughout the page recycle is important. This is what the INP metric assesses. Good responsiveness means that the page is responding quickly to Interactions made with it. The goal of INP is to ensure the time from when a user initiates an interaction until the next frame is painted as short as possible, for all most all interactions the user makes.  Let's see few points which need to be kept in mind. What's INP? ​What's a good INP value. ​How to measure INP....