Invision Community 4: SEO, prepare for v5 and dormant account notifications Matt November 11, 2024Nov 11
Posted May 15May 15 Running the latest version Invision Community v4.7.20My server CPU is getting pinned from Google hammering my site. The key piece of what I'm seeing is repeated crawls to the same URLs with only the CSRF key differing. I've had approaching 200K crawl requests in the last 24 hours due to this.Path: /discover/Query string: ?view=condensed&csrfKey=7a557ea6a17ebdcd56d8ada8d96ad88ePath: /discover/Query string: ?view=condensed&csrfKey=135fb1d7ca2cfd705a46b4ca9f887ca9Path: /discover/Query string: ?view=condensed&csrfKey=8d12116a444dad5ed99f718ea91da466Any idea what can be causing this?I have a temporary block in CF outright blocking Googlebot until this gets figured out. Don't want it there long or this impacting my search rankings.I've also filed a report with Google on the excessive amount of crawling and the SEO crawl budget being wasted ala this crawl trap.I can also put this in my robots.txt, which is my next step so I can lift the CF block:User-agent: GooglebotDisallow: /*?*csrfKey=
May 15May 15 Author FWIW, I lifted the CF block and was relying on robots.txt exclusively but CPU spiked immediately. So Google needs time to see the robots.txt change and adjust. For now rather than outright blocking Googlebot, I put a Managed Challenge in place on CF. That should get it to back down quicker without outright blocking it, but maybe that's effectively blocking them anyway. I'll see how things go in a couple days and try some experimental lifts of the CF rule to see if Googles respecting robots.txt at that point.But if anyone has any advice on this, please let me know!
May 16May 16 Community Expert Are you using our default robots.txt file there, or have you added your own?
May 16May 16 Author 6 hours ago, Marc said:Are you using our default robots.txt file there, or have you added your own?I have the IC defaults in there from whenever I last came across them quite some time ago, and then I have some other things. But it did not have that csrfKey Disallow. Do you have a link to the latest recommendation? I thought it would be in the downloaded software package, but it isn't. I'll keep searching as well.I found it here: https://invisioncommunity.com/4guides/advanced-options/configuration-options/seo-robotstxt-r364/Nothing that would block those csrfKey requests no?
May 16May 16 Community Expert The recommendation would be not to have a robots.txt file manually added at all, as we have one generated by the software. That guide is actually very old. In search engine management, you should then have it set to "Invision Community optimized" under crawl management
May 16May 16 Author 16 minutes ago, Marc said:The recommendation would be not to have a robots.txt file manually added at all, as we have one generated by the software. That guide is actually very old. In search engine management, you should then have it set to "Invision Community optimized" under crawl managementI actually do use this one on of my startup sites, however I am not sure how this works as there is no robots.txt generated. And that site still does get CSRF requests via Googlebot that get blocked (as I added the CF rule there as well to track), but not as many as the primary site that is getting hammered.There are cases where I need to add things to a robots.txt (such as for my ad partner). How do you recommend doing that when using the "Invision Community optimized" option? Is there anywhere we can see what "Invision Community optimized" does or the effective robots.txt?Interestingly for the site that is getting hammered, I see I actually also have that "Invision Community optimized" setting enabled BUT also have a robots.txt.
May 16May 16 Community Expert 4 minutes ago, Clover13 said:there anywhere we can see what "Invision Community optimized" does or the effective robots.txt?You can enable it in the ACP and then go to {your-base-url}/robots.txt . Make sure you remove any hard existing files on your server as that URL is virtual in our software.
May 16May 16 Author 19 minutes ago, Jim M said:You can enable it in the ACP and then go to {your-base-url}/robots.txt . Make sure you remove any hard existing files on your server as that URL is virtual in our software.I see, ok. Below is the excerpt of an "Invision Community optimized" generated one.EDIT: I see the issue now, your impl has /discover blocked and I see I have that lifted. Thinking through that, they are only Activity Stream links to topics Googlebot would already be crawling, so no value in crawling /discover.# Rules for Invision Community (https://invisioncommunity.com) User-Agent: * # Block pages with no unique content Disallow: /startTopic/ Disallow: /discover/unread/ Disallow: /markallread/ Disallow: /staff/ Disallow: /cookies/ Disallow: /online/ Disallow: /discover/ Disallow: /leaderboard/ Disallow: /search/ Disallow: /tags/ Disallow: /*?advancedSearchForm= Disallow: /register/ Disallow: /lostpassword/ Disallow: /login/ Disallow: /*currency= # Block faceted pages and 301 redirect pages Disallow: /*?sortby= Disallow: /*?filter= Disallow: /*?tab= Disallow: /*?do= Disallow: /*ref= Disallow: /*?forumId* Disallow: /*?&controller=embed # Sitemap URL Sitemap: https://www.{mywebsite}.com/sitemap.php Edited May 16May 16 by Clover13
May 16May 16 Community Expert Have you verified the Google bot is an actual Google bot? Google has their own Cloud offering where people can purchase server space and I see that a lot myself. Simply coming from a Google IP address is not enough to verify as an official Google bot. https://developers.google.com/search/docs/crawling-indexing/verifying-googlebotThe next question is if you have verified it as an official Google bot, have you checked Google Webmasters to see what it is reporting on that link?Links with a CSRF key are usually action-based from a user which we are verifying. However, the view parameter maybe one in itself which we may omit as it isn't adding much value there. I can bring this up internally.
May 16May 16 Author Seems like legit Googlebot per that articleOne of a dozen+ IPs within the same range:host 66.249.75.9696.75.249.66.in-addr.arpa domain name pointer crawl-66-249-75-96.googlebot.com.host crawl-66-249-75-96.googlebot.comcrawl-66-249-75-96.googlebot.com has address 66.249.75.96I can see a major non-indexed spike in this timeframe in Google Search Console. The indexed count is flat.Pre-block, I see the reasoning for the non-index as Page is not indexed: Alternate page with proper canonical tagThat canonical resolves to /discoverWith that said, there is value to /discover being indexed (i.e. not on the robots.txt Disallow, thereby requiring a custom robots.txt) for my site since it is the main landing page (Activity Stream), so I have to manage the bot activity to it. I'm just unclear as to why Googlebot would repeatedly send crawls with csrfKeys. At this point Googlebot has slowed down on requests from Googlebot with a csrfKey due to the CF Managed Challenge and robots.txt update I made last night. I see /discover is indexed in GSC, so I should be good in that respect. Edited May 16May 16 by Clover13