Thank you all for your patience, I have approved every single post that was made both good and bad, and will hopefully address some of the points raised and the issues from the past 2-3 days.
It's been a very busy few days but here's a brief run down of events.
We were alerted to short (5-8 minute) bursts of a sharp increase in response times starting on the 21st.
These bursts didn't last long which made it hard to track down. Our Cloud platform is made from several components, each of which could cause latency issues. We have a WAF to filter traffic, a CDN to create short term caches for guest traffic, ElastiCache (in a read/write cluster), MySQL database clusters (multiple read/write) and then the processing layer where PHP lives.
The bursts between 9/21 and 9/23 only affected about 15% of our customers due to how the database clusters are segregated but coincided with an increase update of 4.7.18. One of the main changes in 4.7.18 was to how often the write MySQL servers are used. The write servers are really good at writing (insert/delete/update) but less useful for complex select queries. One downside of user a read/write separation is the replication delay. You can insert a record to the write server, and this has to then copy it across to the read servers. So, when we recount the last post in a topic, and forum, and recount the number of comments, etc we run that select query on the write server so we know it has the latest data. This is fine, but it puts a heavy load on the write servers. So, in .18 we removed the select queries from the write server and added a task to recount again every five minutes or so just incase there are any odd issues from race conditions on busy sites (and we have some super busy sites - one currently has 36,000 active sessions).
After a lot of debugging, we tracked the issue down to the use of ElastiCache to manage the locking flags when recounting. This meant that busy sites couldn't lock fast enough with ElastiCache as there is a very tiny window of replication lag. So instead of the expensive recount query running just once, it would run 3, 5, or 10 times before the lock was created. Multiply this for all sites and it increased load at the database level due to InnoDB locking and unlocking rows.
We tried several interventions which seemed to work, but randomly did not a day later. This is very frustrating for us, and very frustrating for you.
Yesterday, we found a solution via a hotfix deployed to all 4.7.18 sites on our platform to use database locking. It drives up database I/O a little but not enough to cause concern, and we have rewritten this recounting feature already for .19 to use a task which has more robust locking and is proven to avoid race conditions.
Yesterday we did see random latency issues that affected most sites between 10am and 12pm EST on and off, with peaks occurring around 10am and 11:30am. This is in the ElastiCache layer which we're woking in, although we have made changes to the configuration to make that stable.
This has taken a few days to get to the bottom of and has involved multiple members of our engineering team and some long days, so I thank you for your patience. These burst happened so quickly (in relative terms, I know it feels like forever when you experience it on your site) that our external status monitoring doesn't pick it up, but rest assured, our internal monitoring does. It's very loud and impossible to ignore. 😄
I'll address some of the comments:
As mentioned above, these burst are over before our external status monitoring picks it up and often doesn't affect all sites due to the way the MySQL clusters are set up.
Thanks, we had resolved most of the issues last night.
We were running some very short term ElastiCache configuration tests. We were monitoring the response times but needed a few minutes to gather some data. This lasted about 8 minutes total.
Again, thank you for your patience and I know that it can seem like nothing is happening, but we have strong internal monitoring and have been focused on resolving these latency issues. A large complex platform like ours can be quite organic and tough to diagnose as GitLab found out when experiencing similar randomly latency issues.
I can only apologise. The past few days are not indicative of our normal service. I'll reply to your ticket in more detail.