HomeTechnology & Gadgets

Why is 100% indexing not possible, and why is that okay

Why is 100% indexing not possible, and why is that okay

When it comes to topics like crawl budgeting, the historical rhetoric has always been that it's a problem reserv

A Small Business Guide (+ Free Template)
8 Best Business Security Systems in 2022
Building Multiple Streams of Income After 40

This post may contain affiliate links, which means that I may earn a commission if you click on the link & sign-up or make a purchase. You will NOT be charged extra for using the link, but it goes a long way in supporting this blog. I only recommend products or services that I have personally used or believe will add value to my readers.

When it comes to topics like crawl budgeting, the historical rhetoric has always been that it’s a problem reserved for large websites (classified by Google as having over a million web pages) and medium-sized websites with a high content change rate.

However, in recent months, crawling and indexing have become more and more popular topics in SEO forums and in questions asked to Googlers on Twitter.

From my own anecdotal experience, websites of different sizes and frequency of change since November have seen greater fluctuation and reporting changes in Google Search Console (both crawl stats and coverage reports) than they once did.

A number of the major changes in coverage I’ve seen have also been related to uncertain Google updates and high volatility from SERP sensors/monitors. Since none of the websites have much in common in terms of stack, niche, or even technical issues – is this an indication that 100% indexing (for most websites) isn’t possible now, and that’s fine?

This seems logical.

Google explains, in its own documents, that the web is expanding at a pace that far exceeds its own ability and means to crawl (and index) each URL.


Get the daily newsletter that marketers rely on.


In the same documentation, Google identifies a number of factors that affect crawl ability, in addition to the crawl request, including:

  • The popularity of your URLs (and content).
  • It is corruption.
  • How quickly the site responds.
  • Google’s knowledge (perceived inventory) of the URLs on our site.

From conversations with Google’s John Mueller on Twitter, the popularity of your URL is not necessarily influenced by the popularity of your brand and/or industry.

Having first-hand experience of a major publisher that does not have content indexed based on its uniqueness to similar content already published online – as if it is below the quality line and does not have a high enough SERP inclusion value.

For this reason, when working with all websites of a certain size or type (eg e-commerce), I put from day one that 100% indexing is not always a measure of success.

Layers and Fragments Indexing

Google has been quite open to explaining how indexing works.

They use tiered indexing (some content on better servers for faster access) and have a service index stored across a number of data centers that primarily store data that is served in a SERP.

Oversimplifying this:

The contents of the web page (HTML document) are then encoded and stored across the fragments, and the fragments themselves (such as a glossary) are indexed so that they can be queried faster and easier for specific keywords (when the user is searching).

Oftentimes, indexing issues are blamed on technical SEO, and if you have noindex or issues and inconsistencies that prevent Google from indexing content, it’s technical, but more often than not – it’s a value issue.

Useful Purpose and Value of Embedding SERP

When I talk about value proposition, I’m referring to two concepts from Google’s Quality Evaluator Guidelines (QRGs), which are:

  • useful purpose
  • Page quality

Taken together, you create something I refer to as the SERP Include Value.

This is the most common reason for web pages to fall into the “Discovered – Not Currently Indexed” category within the Google Search Console Coverage Report.

In QRGs, Google makes this statement:

Remember that if a page lacks a useful purpose, it should always rank with the lowest page quality no matter what the page rank needs to meet or how well the page is designed.

What does this mean? A page can target the right keywords and check the right boxes. But if it is generally duplicated with other content and lacks additional value, Google may choose not to index it.

This is where we encounter Google’s quality threshold, which is the concept of seeing if a page meets the “quality” needed to be indexed.

A key part of how this quality threshold works is that it’s close to real-time and seamless.

Google’s Gary Ellis confirmed this at Twitterwhere the URL may be indexed when it is first found and then dropped when new (better) URLs are found or even given a temporary “recentness” boost from manual submission in GSC.

Work to see if you have a problem

The first thing to determine is whether you see the number of pages in the Google Search Console Coverage report being moved from included to excluded.

This graph alone and out of context is enough to cause concern among most marketing stakeholders.

But how many pages are you interested in? How many of these pages drive the value?

You will be able to determine this from your collective data. You’ll see if your traffic and revenue/leads are decreasing in your analytics platform, and you’ll notice in third-party tools if you’re losing overall market visibility and ranking.

Once you’ve determined if you’re seeing valuable pages being excluded from Google’s index, the next steps are to understand why and Search Console breaks down into other categories. The most important things to be aware of and understand are:

Crawled – Not currently indexed

This is something I’ve encountered with e-commerce and real estate more than any other sector.

In 2021, the number of new business app registrations in the US breaks previous records, and as more companies compete for users, there’s a lot of new content being posted – but there likely won’t be a lot of new and unique information or perspectives. .

Discovered – Not currently indexed

When debugging indexing, I find this a lot on e-commerce sites or websites that have deployed a large programmatic approach to content creation and have published a large number of pages at once.

The main reasons why pages fall into this category can be due to your crawl budget, as you have just published a large amount of content, new URLs and significantly increased the number of crawlable and indexed pages on the site, and the crawl budget provided by Google that your site is not oriented to these The many pages.

There is not much you can do to influence this. However, you can help Google with XML sitemaps, HTML sitemaps, and good internal linking to pass page rank from important (indexed) pages to these new pages.

The second reason that content might fall into this category is low quality – this is common in programmatic content or e-commerce sites with a large number of products and similar or variable PDPs.

Google can identify patterns in URLs, and if it visits a percentage of those pages and doesn’t find any, it can (and sometimes assumes) HTML documents with similar URLs to be equal (low) in quality, and they’ll choose not to crawl them .

Many of these pages will be created intentionally for the purpose of customer acquisition, such as automated site pages or comparison pages that target specialized users, but these queries are searched with low frequency, they likely won’t get many eyes, and the content may not be unique enough either Compared to other automated pages, Google will not index low-value display content when other alternatives are available.

If this is the case, you will need to evaluate and determine if the goals can be achieved within the project’s resources and parameters without redundant pages clogging crawl and not seen as valuable.

Duplicate content

Duplicate content is one of the most direct content and is popular in e-commerce, publishing and programming.

If the main content of the page, which contains the value proposition, is repeated across other websites or internal pages, then Google will not invest the resource in indexing the content.

This is also related to the value proposition and the concept of useful purpose. I’ve come across many examples where large, trusted websites contain content that is not indexed because it is similar to other available content – they don’t offer unique perspectives or unique value propositions.

behavior

For most large websites and decent medium sized sites, achieving 100% indexing will become more difficult as Google has to process all the existing and new content on the web.

If you find valuable content considered to be below the quality threshold, what actions should you take?

  • Optimizing internal links from ‘high value’ pages: This does not necessarily mean pages with the most backlinks, but those pages that rank for a large number of keywords and have good visibility can pass positive signals through meta links to other pages.
  • Trim low-quality, low-value content. If the pages excluded from the index are of low value and yield no value (eg pageviews, conversions), they should be pruned. Having them live is a waste of Google’s crawl resource when you choose to crawl them, and that can affect their assumptions of quality based on perceived URL pattern matching and inventory.

The opinions expressed in this article are those of the guest author and not necessarily those of the search engine. Staff authors are listed here.


New in search engine land

About the author

Dan Taylor is Head of Search Engine Optimization (SEO) at SALT Agency, a UK-based SEO firm and 2022 Queen’s Award winner. Dan works with and oversees a team that works with companies ranging from technology companies and SaaS to e-commerce for enterprises.

COMMENTS

WORDPRESS: 0
    error: Content is protected !!