Jump to content

Submitted URL has crawl issue / Failed: Crawl anomaly


Recommended Posts

Posted

Since the end of June my site has had a quickly growing number of forum pages that google cannot properly crawl - see the screen shots below (Screen 1).

When I open any of these links the Google Search Console (see Screen 2) and run the "Test Live URL" there are zero issues. All issues are for "Crawled as Googlebot smartphone."

I did a check of my change log for any changes I made just before this issue began, and I found the following:

6/5/2019 - Upgraded to IPB 4.4.4.
6/5/2019 - Upgraded to PHP 7.3 w/ memcache from PHP 7.1 w/ memcache
6/11/2019 - applied patch for Invision Community v4.4.4

I have switched to using PHP 7.2 to see if this changes anything. Since the issue could be related to the IPB upgrades I ran during that time, I am checking to see if anyone else is having this issue?

Screen 1

image.thumb.png.c634e0207a482899a56fd71f5501de01.png

 

Screen 2

image.thumb.png.e400fe8d9fa7ae46b5947ddf4c6ee4e1.png

 

Posted

I may go this route, but in all honesty tickets on any topic related to google indexing issues seem to not be well received by your team.  

 

 

Also, I am testing this as a possible fix, and will report back if it works. We reported this error:

 

Posted

Preloading (or not) a CSS file is not going to cause a "crawl anomaly". You're welcome to try his feedback, but I would be shocked if it had any impact on the issue you're reporting here.

Posted

Hi, we have the same issue since end of June this year:

screenshot-search.google.com-2019_09.29-01-04-50.thumb.png.ad4bf91903037e3d7e388923825cefbb.png

All of them are supposed to have crawl issue. But none of the pages I have checked manually have any issues. They can be opened and viewed normally. Live test on the pages in Google Webmaster performs well. I can also resubmit an URL to index after it. "Validate fix" button fails after few days. I have started it more than 10 times since then. It runs and then fails. I have no explanation for this.

The "funny" thing: you can see when the page has been last crawled in the list of examples below the graph. I have checked my logs. On the date the page has been last crawled there were not any requests from Google bot for these pages. 

confused good mythical morning GIF by Rhett and Link

Posted

Interesting, this is why I posted this, because I suspect that others are having this issue. Your time period matches mine, but it looks like your issues may be getting resolved, as I see a nice drop in issues around 9/15. So far I've not seen a drop.

Posted

@sadams101, this is assumption, what going on:

Google has changed his algo for calculating crawl budget in June. And now he creates these errors if he runs out of crawl budget. This would also explain why there are no requests for the pages in question on the last crawled date. To reduce these errors:

  1.  I have reviewed my robots.txt and excluded everything that includes parameter like sort, sortby, sortdirection etc. I have also excluded profiles, calendar weeks, blogs and everything that has "noindex" due to lack of content. I have also excluded any technical URL starting with /application/ and so on.
  2. I have also reviewed URL parameters section in the Google webmaster and excluded those parameters either.
  3. I have excluded some "unimportant" apps from sitemap generation.
  4. And I have limited the number of the URLs in the sitemaps that are generated. There are now 500 to 1000 URLs instead of ALL.

It seems that is helps to reduce the number of errors. You can indeed see that the issues tend to be resolved. I cannot confirm right now that it is solved but I can see the reduction indeed. If you have a large project try to reduce the number of crawled pages and see if it helps. 

Posted
36 minutes ago, sadams101 said:

Any chance you can share your robots.txt file with us?

It heavily depends on the apps you have installed and how excessive you use them, how important are they for your SEO. This file cannot be used "just for all", here is just an example.

User-agent: *
Disallow: /profile/
Disallow: /notifications/
Disallow: /applications/
Disallow: /calendar/*/week/
Disallow: /*?*sortby=
Disallow: /*?*sort=
Disallow: /*?*sortdirection=
Disallow: /*?*desc=
Disallow: /*?*d=
Disallow: /*?*tab=reviews

Sitemap: https://www.example.com/sitemap.php

You have to check if it suits to YOUR project and also triple check the file before uploading. This can ruin your SEO if you make a mistake.

Posted

Thank you for this. I've seen similar robots.txt files with similar entries, like below. Any idea some do these differently?

Disallow: /*sortby=*

Would it be possible for you to share a screenshot of your google URL parameters?

Posted

One issue that seems to be ongoing, which may cause this, is something that I've noticed with caching. I have noticed that I get errors occasionally that my site is not mobile friendly. When this happens I do a live test and see my mobile skin cached wrongly, so that the right column appears like in the desktop version, and then I need to clear my cache by using the support tool, re-test, and then all is ok. 

I believe that this might be the real culprit with these errors. I just discovered that this was happening, and fixed it. The question is, why does it keep happening? Certainly it could be a bug in my custom skin, but does anyone else have this issue? Does it happen in the default skin?

Posted
15 hours ago, sadams101 said:

I've seen similar robots.txt files with similar entries, like below. Any idea some do these differently?

Disallow: /*sortby=*

It is the same. My line makes it sure, that there is a parameter and not part of the URL. On the other hand URL cannot contain "=" in it, so it would match only parameter 😉

15 hours ago, sadams101 said:

Would it be possible for you to share a screenshot of your google URL parameters?

screenshot-www.google.com-2019_10.03-21-38-30.png.91fcab3c3fa9a45d90a0a453d69234dd.png

  • 2 weeks later...
Posted

I am still having issues with the crawl anomaly, and I'm not sure this is a resource issue. I've noticed some issues with the pages after running the Live Test on a page with this problem, then "View Tested Page." For some reason, google cannot load most of the images on my page, and there are javascript errors. Here is a screen shot of what google sees on one of these pages...look to the right and you will see a broken link where my logo should be:

image.thumb.png.4a150319e851f503e3b6a71387422dff.png

 

When I select the "More Info" tab at the right I see 

Page resources
26/40 couldn't be loaded...examples below (each of these load fine, yet they do not load for Googlebot smartphone):
 
 
Other error
Script
 
 
 
JavaScript console messages - I am not sure what these errors mean:
3 messages
 
Error
00:12.000
Uncaught ReferenceError: ips is not defined at https://www.celiac.com/forums/topic/44830-helpful-tips/:11787:4
 
Error
00:15.000
Uncaught ReferenceError: $ is not defined at init (https://www.celiac.com/forums/topic/44830-helpful-tips/:12283:1)
 
 
But, the errors are increasing:
image.thumb.png.1b3130805b0d49a74bb8d636e96c7146.png
 
Any help would be appreciated.
Posted

PS - I just tried uploading a new logo in my custom skin, then I refreshed the view and fetched the page again as googlebot smartphone, and look at the screenshot to the right...the page is the desktop version! I suspect what is happening is a caching issue, where the page is not mobile friendly, thus the error. Clearing the cache in the support tool fixes this, but it keeps coming back:

image.thumb.png.c55791a59793a1485c798ca437b33b46.png

Posted
2 hours ago, sadams101 said:

For some reason, google cannot load most of the images on my page, and there are javascript errors.

There is a quota for testing tools similar to crawl budget:

Quote

 Google has a quota per site, for the number of requests its WILLING to make to the particular server. Partly to avoid an accidental DOS (ie if Google made requests without restraint it could quickly overwhelm most servers, Google has more servers than most sites!), but also as a 'resource sharing' system, to avoid devoting too much bandwidth to any particular site, and then not being able to index other sites. 
....
So some requests 'fail' because Google proactively aborted the request - without contacting the server - because it WOULD potentially push it over quota

Source https://support.google.com/webmasters/thread/2293148?hl=en

We have checked the "issue" by starting testing tool AND viewing our live server logs at the same time. The resources that are supposed to have crawl issues have not even been requested by Google bot. Google fails but not because your site fails. It fails because it "decided" not to crawl. 

We are reducing the number of pages in sitemap consequently right now. We also block a lot of "unimportant" URLs in robots.txt and we work with parameters. While the number of indexed pages goes down, the number of errors goes down as well. There is no change in organic traffic. We "loose" pages but not traffic. The pages we have delisted are just wasting our crawling budget. So we concentrate on important and newest pages at the moment.

Posted

I have a really hard time believing that my requesting one single web page on my site would push any quota that Google might have for my site. Honestly, I'm using their tool which is supposed to help me debug any issues...why would it throw those errors if there were NO issues? It makes no sense whatsoever, and I've seen as many different explanations on this crawl anomaly as can be found using google search for the issue--dozens, and all are different. I can pull up 20 different threads on google's forum that all say very different things from different "Gold Product Experts" like the one you show here. 

I know you believe it is a crawl budget issue that is causing two very different issues, but why wouldn't google just list it as "Crawl Budget" directly? 

Posted

@sadams101, robots.txt above was just an example. We have even more included there, but it is different from project to project. The both URLs are not reachable by guests / bots in our projects as we do not use "Post before register" feature. Thus there is no need for us to exclude these URLs.

Posted
7 hours ago, sadams101 said:

It makes no sense whatsoever

Yep, that's why I do not really pay much attention to "issues" in Google Webmaster if we cannot reproduce them. As I have said: if there is NO request from Google bot for the URL, that is supposed to have a crawling issue, then it is not much that I can do about it, isn't it? 

7 hours ago, sadams101 said:

but why wouldn't google just list it as "Crawl Budget" directly? 

I would also wish Google would not return a generic error "Crawl issue" if there are no issues with the URL itself. 

Posted
19 hours ago, sadams101 said:

 I suspect what is happening is a caching issue, where the page is not mobile friendly, thus the error. 

Just a note - this is not actually possible. We do not serve a different version of the page based on whether the request is mobile or not - we serve the same exact HTML to every single client. It's the CSS code that determines when to move things around, hide things, etc. based on the device's display. This is a technique known as "responsiveness".

Just wanted to be clear...it's not possible that a "non mobile friendly" page could ever be served.

Posted
2 hours ago, bfarber said:

Just a note - this is not actually possible. We do not serve a different version of the page based on whether the request is mobile or not - we serve the same exact HTML to every single client. It's the CSS code that determines when to move things around, hide things, etc. based on the device's display. This is a technique known as "responsiveness".

Just wanted to be clear...it's not possible that a "non mobile friendly" page could ever be served.

I took a screen shot yesterday where it happened. It was a desktop page served to googlebot smartphone, thus my theory about a caching issue--notice the right-column showing up in the screen shot:

 

Posted

I opened a thread on this topic in Google's forums

https://support.google.com/webmasters/thread/16988763?hl=en

where, at least in my case, they noticed that I had a large number of 301 redirects in my site's links. I have fixed many of these, with the exception of the one below, which shows up for guest posters. Does anyone know how to do a FURL to make this become https://www.celiac.com/login/  ?

Posted

I do see a FURL already set up for this type of link, but for some reason it does not seem to be working in the post as guest field...the original URL is showing up there. Should I put the template bit you shared somewhere to fix that? 

image.thumb.png.5fa68a9db47d53d718e7113db16a2d7c.png

Posted

It sounds like you're saying the non-FURL version is being used, and if a guest clicks it they're redirected to the FURL, is that right?

Where is the non-FURL version showing up exactly?

Archived

This topic is now archived and is closed to further replies.

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...