Jump to content
Matt

SEO: Improving crawling efficiency

No matter how good your content is, how accurate your keywords are or how precise your microdata is, inefficient crawling reduces the number of pages Google will read and store from your site.

Search engines need to look at and store as many pages that exist on the internet as possible. There are currently an estimated 4.5 billion web pages active today. That's a lot of work for Google.

It cannot look and store every page, so it needs to decide what to keep and how long it will spend on your site indexing pages.

Right now, Invision Community is not very good at helping Google understand what is important and how to get there quickly. This blog article runs through the changes we've made to improve crawling efficiency dramatically, starting with Invision Community 4.6.8, our November release.

AdobeStock_288594415.jpeg

The short version
This entry will get a little technical. The short version is that we remove a lot of pages from Google's view, including user profiles and filters that create faceted pages and remove a lot of redirect links to reduce the crawl depth and reduce the volume of thin content of little value. Instead, we want Google to focus wholly on topics, posts and other key user-generated content.

Let's now take a deep dive into what crawl budget is, the current problem, the solution and finally look at a before and after analysis. Note, I use the terms "Google" and "search engines" interchangeably. I know that there are many wonderful search engines available but most understand what Google is and does.

Crawl depth and budget
In terms of crawl efficiency, there are two metrics to think about: crawl depth and crawl budget. The crawl budget is the number of links Google (and other search engines) will spider per day. The time spent on your site and the number of links examined depend on multiple factors, including site age, site freshness and more. For example, Google may choose to look at fewer than 100 links per day from your site, whereas Twitter may see hundreds of thousands of links indexed per day.

Crawl depth is essentially how many links Google has to follow to index the page. The fewer links to get to a page, is better. Generally speaking, Google will reduce indexing links more than 5 to 6 clicks deep.

The current problem #1: Crawl depth
A community generates a lot of linked content. Many of these links, such as permalinks to specific posts and redirects to scroll to new posts in a topic, are very useful for logged in members but less so to spiders. These links are easy to spot; just look for "&do=getNewComment" or "&do=getLastComment" in the URL. Indeed, even guests would struggle to use these convenience links given the lack of unread tracking until logged in.  Although they offer no clear advantage to guests and search engines, they are prolific, and following the links results in a redirect which increases the crawl depth for content such as topics.

The current problem #2: Crawl budget and faceted content
A single user profile page can have around 150 redirect links to existing content. User profiles are linked from many pages. A single page of a topic will have around 25 links to user profiles. That's potentially 3,750 links Google has to crawl before deciding if any of it should be stored. Even sites with a healthy crawl budget will see a lot of their budget eaten up by links that add nothing new to the search index. These links are also very deep into the site, adding to the overall average crawl depth, which can signal search engines to reduce your crawl budget.

Filters are a valuable tool to sort lists of data in particular ways. For example, when viewing a list of topics, you can filter by the number of replies or when the topic was created. Unfortunately, these filters are a problem for search engines as they create faceted navigation, which creates duplicate pages.

AdobeStock_359225185.jpeg

The solution
There is a straightforward solution to solve all of the problems outlined above.  We can ask that Google avoids indexing certain pages. We can help by using a mix of hints and directives to ensure pages without valuable content are ignored and by reducing the number of links to get to the content. We have used "noindex" in the past, but this still eats up the crawl budget as Google has to crawl the page to learn we do not want it stored in the index.

Fortunately, Google has a hint directive called "nofollow", which you can apply in the <a href> code that wraps a link. This sends a strong hint that this link should not be read at all. However, Google may wish to follow it anyway, which means that we need to use a special file that contains firm instructions for Google on what to follow and index.

This file is called robots.txt. We can use this file to write rules to ensure search engines don't waste their valuable time looking at links that do not have valuable content; that create faceted navigational issues and links that lead to a redirect.

Invision Community will now create a dynamic robots.txt file with rules optimised for your community, or you can create custom rules if you prefer.

136976533-8b04322a-1b50-4ed6-85f7-b86663367982.png

The new robots.txt generator in Invision Community

Analysis: Before and after
I took a benchmark crawl using a popular SEO site audit tool of my test community with 50 members and around 20,000 posts, most of which were populated from RSS feeds, so they have actual content, including links, etc. There are approximately 5,000 topics visible to guests.

Once I had implemented the "nofollow" changes, removed a lot of the redirect links for guests and added an optimised robots.txt file, I completed another crawl.

Let's compare the data from the before and after.

First up, the raw numbers show a stark difference.

crawl_before_and_after.png

Before our changes, the audit tool crawled 176,175 links, of which nearly 23% were redirect links. After, just 6,389 links were crawled, with only 0.4% being redirection links. This is a dramatic reduction in both crawl budget and crawl depth. Simply by guiding Google away from thin content like profiles, leaderboards, online lists and redirect links, we can ask it to focus on content such as topics and posts.

Note: You may notice a large drop in "Blocked by Robots.txt" in the 'after' crawl despite using a robots.txt for the first time. The calculation here also includes sharer images and other external links which are blocked by those sites robots.txt files. I added nofollow to the external links for the 'after' crawl so they were not fetched and then blocked externally.

crawl_depth_before.png

As we can see in this before, the crawl depth has a low peak between 5 and 7 levels deep, with a strong peak at 10+.

crawl_depth_after.png

After, the peak crawl depth is just 3. This will send a strong signal to Google that your site is optimised and worth crawling more often.

Let's look at a crawl visualisation before we made these changes. It's easy to see how most content was found via table filters, which led to a redirect (the red dots), dramatically increasing crawl depth and reducing crawl efficiency.

vis_before.png

Compare that with the after, which shows a much more ordered crawl, with all content discoverable as expected without any red dots indicating redirects.

vis_after1.png

Conclusion
SEO is a multi-faceted discipline. In the past, we have focused on ensuring we send the correct headers, use the correct microdata such as JSON-LD and optimise meta tags. These are all vital parts of ensuring your site is optimised for crawling. However, as we can see in this blog that without focusing on the crawl budget and crawl efficiency, even the most accurately presented content is wasted if it is not discovered and added into the search index.

These simple changes will offer considerable advantages to how Google and other search engines spider your site.

The features and changes outlined in this blog will be available in our November release, which will be Invision Community 4.6.8.

Comments

Recommended Comments



Can you add a link to view the invision optimised/managed robots.txt and the ability to add additional rules along with automation?

We have additional areas on the site above and beyond what would be on the invision platform we would want to robots.txt manage and adding the lines to the automatic ones would be brilliant.

Link to comment
Share on other sites

  • Management
52 minutes ago, sudo said:

Can you add a link to view the invision optimised/managed robots.txt and the ability to add additional rules along with automation?

We have additional areas on the site above and beyond what would be on the invision platform we would want to robots.txt manage and adding the lines to the automatic ones would be brilliant.

Yes 🙂 If there is an existing robots.txt file, this is detected and you are promoted to download the file so you can manually add the rules to your existing file. 

Link to comment
Share on other sites

@Matt Really great, we really appreaciate the incredible effort put into improving this. Just one suggestion, to leave a way for us to add custom lines to the robots.txt in addition the optimized version (from the screenshot above, I got the impression that we can only select between optimized vs custom). So this options would create the optimized robots.txt + add the extra lines we configure manually. Because we have some custom directives to slow down or ban some bad crawlers that we wish to keep.

Edited by Gabriel Torres
Link to comment
Share on other sites

  • Management
12 hours ago, Gabriel Torres said:

@Matt Really great, we really appreaciate the incredible effort put into improving this. Just one suggestion, to leave a way for us to add custom lines to the robots.txt in addition the optimized version (from the screenshot above, I got the impression that we can only select between optimized vs custom). So this options would create the optimized robots.txt + add the extra lines we configure manually. Because we have some custom directives to slow down or ban some bad crawlers that we wish to keep.

Absolutely. If you already have a robots.txt file then you will be asked to download the generated version and then you can merge them in manually.

Otherwise, you can copy the rules from this guide and apply them yourself.

16 hours ago, 403 - Forbiddeen said:

OMG! I love this new function. I hope i can receive more visits now after add this new feature. Thanks a lot.

Well, 403 Forbidden isn't a great start... 😝

Link to comment
Share on other sites

This is a great update and one that I look forward to having on our community. Thanks for your work on this. This makes me very happy to be working with Invision.

Link to comment
Share on other sites

I believe that your approach to the issue of links that contain ?do=find is not a good SEO solution. Providing a way to add entries to your robots.txt files is not a bad idea, but in reality just blocking Google from crawling these links, won't get rid of the biggest SEO problem that they create, and I don't believe that it's crawl budget...it's all of the 301 redirects that happen due to the fact that you use such links throughout IPS software.

I believe that the best solution is simply to rewrite your code (I had a custom plugin made to do this) so that the proper, final link is used, rather than a link containing something like ?do=findComment. If you write code that doesn't use a 301 redirect it will eliminate both issues. I also have a robot.txt entry for these as well, but that alone did not help the 301 redirect issue, which my SEM Rush account had already flagged as a critical fix for my site's SEO.

Edited by sadams101
Link to comment
Share on other sites

  • Management
23 hours ago, sadams101 said:

I believe that your approach to the issue of links that contain ?do=find is not a good SEO solution. Providing a way to add entries to your robots.txt files is not a bad idea, but in reality just blocking Google from crawling these links, won't get rid of the biggest SEO problem that they create, and I don't believe that it's crawl budget...it's all of the 301 redirects that happen due to the fact that you use such links throughout IPS software.

I believe that the best solution is simply to rewrite your code (I had a custom plugin made to do this) so that the proper, final link is used, rather than a link containing something like ?do=findComment. If you write code that doesn't use a 301 redirect it will eliminate both issues. I also have a robot.txt entry for these as well, but that alone did not help the 301 redirect issue, which my SEM Rush account had already flagged as a critical fix for my site's SEO.

I've removed a lot of the 301 links for guests. As you can see from the visualisations, the before is full of red (301s) and now they are green. This is not just from the robots.txt file, this is from removing those 301 links.

Link to comment
Share on other sites

I'm very glad to hear this, and hope that you've removed all of them. If not, can you please tell me which ones were not removed? I ask because I do have a plugin that does this, which required some template changes, and it would be good to know which templates I should revert before I upgrade.

Link to comment
Share on other sites

On 10/28/2021 at 12:06 PM, Stuart Silvester said:

In addition to adding relevant robots.txt rules for those redirects, we did also remove a lot of them for guest users. In Matt's last screenshot, it shows a large reduction of redirects when crawling a community.

excellent... Keep up the good work.... 🙂 

Link to comment
Share on other sites

@MattI find it laudable that you are improving upon how crawlers access one's site.  Could you also address how one might best setup their site if they do not want "ANY" crawlers to access their sites "EVER" or not until such a time as their site is properly setup and has sufficient content to really take advantage of being "crawled".  That could take weeks or months depending on the expertise of a site admin and how quickly a site grows and people post "useful" to crawl content.

Edited by Chris Anderson
Link to comment
Share on other sites

42 minutes ago, Chris Anderson said:

@MattI find it laudable that you are improving upon how crawlers access one's site.  Could you also address how one might best setup their site if they do not want "ANY" crawlers to access their sites "EVER" or not until such a time as their site is properly setup and has sufficient content to really take advantage of being "crawled".  That could take weeks or months depending on the expertise of a site admin and how quickly a site grows and people post "useful" to crawl content.

You can deny all search bots with two lines of text inside of robots.txt.

https://www.hostinger.com/tutorials/website/how-to-block-search-engines-using-robotstxt

Specifically:

User-agent: *
Disallow: /

Once you want to allow crawling, remove this and use the optimized settings provided by IPS. 

Edited by Randy Calvert
Link to comment
Share on other sites




Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Add a comment...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...

×
×
  • Create New...