tforums Posted April 21 Posted April 21 (edited) In the age of AI and Social Media, there are more bots than ever. I recently noticed we are giving up 70% of our server load to HEAD requests. This appears to be fueling a constant load on the database even when there is very little activity from members. If you are seeing a kind of pumping activity of load rise and fall, it is probably from this. I have blocked HEAD requests to mitigate this activity because it is being done by dynamic IP sources without user agents and or ignoring robots.txt. However, this is not ideal as HEAD should be saving load from GET requests. Seems to me the IPB system could somehow mitigate this or is everything, even the headers, coming from the system DB? IMO, IPB really needs to focus some tools on how to deal with bots and scraping. The AI boom is literally inciting others to machine-learn our community activity for who knows what uses. This is not only hurting our sever performance, it is abusing the community and members in ways we cannot forsee. Edited April 21 by tforums
tforums Posted April 21 Author Posted April 21 ... I cannot add to the post for some reason. But I am seeing some very abusive activity from Facebook and other scrapers where they slam a site with 10 requests from 10 different IPs. This appears to be a shortcut for them doing replication. Its much faster and easier to hit us 10 times than read once, and then sync their backend.
Jim M Posted April 21 Posted April 21 Do you have an example of the request? This could be the form of an attack, if there are many requests at the same time, which would need to be mitigated at the server or network level for optimal results as the software would just cause more consumption of resources.
tforums Posted April 21 Author Posted April 21 Just now, Jim M said: Do you have an example of the request? This could be the form of an attack, if there are many requests at the same time, which would need to be mitigated at the server or network level for optimal results as the software would just cause more consumption of resources. FWIW, I don't think this is an attack so much as we are a very old, large, and popular community. Yes see above. I could send you logs all day. Probably not something I should post here. LMK where to send them if you want to take a look.
KT Walrus Posted April 21 Posted April 21 Isn’t this the way the web works? HEAD requests are an optimization so no need to refetch the page/file if it hasn’t changed. You wouldn’t want Google or some other search engine re-downloading something that hasn’t changed. This would waste tremendous amounts of bandwidth. G17 Media 1
tforums Posted April 21 Author Posted April 21 7 minutes ago, KT Walrus said: Isn’t this the way the web works? HEAD requests are an optimization so no need to refetch the page/file if it hasn’t changed. You wouldn’t want Google or some other search engine re-downloading something that hasn’t changed. This would waste tremendous amounts of bandwidth. Yes, I said that. Which is why I am trying to see if IPB can cache HEAD requests instead of hitting the DB every time.
KT Walrus Posted April 22 Posted April 22 16 hours ago, tforums said: Yes, I said that. Which is why I am trying to see if IPB can cache HEAD requests instead of hitting the DB every time. Sorry. I didn’t understand your issue. Maybe putting Cloudflare between the bot traffic and your server would help? That is, if you don’t mind blocking most bots from scrapping your site. Or maybe you could rate limit these bots with Cloudflare.
tforums Posted April 23 Author Posted April 23 On 4/22/2024 at 2:32 AM, KT Walrus said: Sorry. I didn’t understand your issue. Maybe putting Cloudflare between the bot traffic and your server would help? That is, if you don’t mind blocking most bots from scrapping your site. Or maybe you could rate limit these bots with Cloudflare. Thanks. Like you said, the ideal solution is to maintain the HEAD option. IPB already caches a lot of things, so it should be possible to at least cache the header for the /discovery/ pages that are being hit constantly to check for updates. This is only going to get worse as AI crawlers are deployed. Forums communities are being targeted for their embedded knowledge. Even if your site isn't getting hit a lot, I am sure it is adding a some load to the server. The IPB cloud service is likely burning a lot of CPU because of this. Do us all a favor please! 😄
Jim M Posted April 23 Posted April 23 7 minutes ago, tforums said: IPB already caches a lot of things We no longer cache full pages/responses. This was removed some time ago in effort to move customers more towards better solutions that wouldn't utilize so much server resources, e.g. CDN caching. We pass a cache header but it won't cache unless you have a CDN (or similar) servicing your community. Example of a HEAD request to our all activity stream. As you can see, there are caching headers which will tell our CDN to now cache this page for 900 seconds (15 min).
tforums Posted April 24 Author Posted April 24 On 4/23/2024 at 9:04 AM, Jim M said: We no longer cache full pages/responses. This was removed some time ago in effort to move customers more towards better solutions that wouldn't utilize so much server resources, e.g. CDN caching. We pass a cache header but it won't cache unless you have a CDN (or similar) servicing your community. Sorry. I have not kept up with these changes because everything has been running so well until recently. You are right, a CDN is a better option for all the long term unknowns I am worried about. Marc 1
Recommended Posts