Jump to content

Recommended Posts

Posted
8 minutes ago, My Sharona said:

When is the elephant in the Cloud going to be addressed?

 

 

We understand your frustration, and certainly don't want you to be experiencing issues with our cloud environment. We are working to having this resolved as soon as possible

Posted

Lovely. I just wasted half an hour trying to upload/save to server images. After watching to make sure each image was uploaded, then watching the progress bar incrementally inch along at a snails pace, i get the following... I don't have a half hour to waste.

Could contain: Page, Text

Posted
1 minute ago, My Sharona said:

Lovely. I just wasted half an hour trying to upload/save to server images. After watching to make sure each image was uploaded, then watching the progress bar incrementally inch along at a snails pace, i get the following... I don't have a half hour to waste.

Could contain: Page, Text

As said above. We're looking into this and will have a resolution as soon as we can. We certainly understand nobody wants to be experiencing issues like that

Posted
Just now, Marc said:

We understand your frustration, and certainly don't want you to be experiencing issues with our cloud environment. We are working to having this resolved as soon as possible

Marc, Myself and others are not referring to specific instances, well, we are but my point here that you responded to is rather the ongoing, endless disruptions. You guys try to squash discussions of it by closing threads and urging peeps to use tickets so there is no public disgust displayed. When is it going to be rectified? I experienced down times at my most trafficked times the last two weeks. It is not a good situation for myself and as I can read, by others. 

Posted
2 minutes ago, My Sharona said:

Marc, Myself and others are not referring to specific instances, well, we are but my point here that you responded to is rather the ongoing, endless disruptions. You guys try to squash discussions of it by closing threads and urging peeps to use tickets so there is no public disgust displayed. When is it going to be rectified? I experienced down times at my most trafficked times the last two weeks. It is not a good situation for myself and as I can read, by others. 

Thanks for reaching out. From what you have put there you're wondering when this will be resolved fully. As I said above, I totally understand your concern. We're currently looking into it and are doing our best to find a solution as quickly as possible. Until that point, there isnt much else in terms of information I have to provide.  I’ll keep you posted as soon as I have more information. Thanks for your patience!

1 minute ago, My Sharona said:

Lol, and now the content must be approved?

 

Too damn funny.

Yes, this is to avoid repeating ourselves. I appreciate your understanding as we work to address the situation. While there's no censoring happening, it’s not effective to keep answering the same question multiple times

Posted

I may have migrated to the wrong platform.  I say that for two reasons.  First, because we've been here for six days and have experienced outages an very poor response times.  Not something we are willing to pay for.

Second, if my comments have to be approved by a moderator then I KNOW I'm in the wrong place.

I will be taking this up with the people who moved me here.

Posted

Not that it's particualrly helpful but just wanted to add a +1 to the above.

Our users are reporting very slow response times and multiple instances of "this community is temporarily unavailable"

I know this is your bread and butter now so I know it will be fixed, but equally it is frustrating that the system status says everything is fine, when it's clearly not!

  • Management
Posted

Thank you all for your patience, I have approved every single post that was made both good and bad, and will hopefully address some of the points raised and the issues from the past 2-3 days.

It's been a very busy few days but here's a brief run down of events.

We were alerted to short (5-8 minute) bursts of a sharp increase in response times starting on the 21st.

Could contain: Chart, Bar Chart

These bursts didn't last long which made it hard to track down. Our Cloud platform is made from several components, each of which could cause latency issues. We have a WAF to filter traffic, a CDN to create short term caches for guest traffic, ElastiCache (in a read/write cluster), MySQL database clusters (multiple read/write) and then the processing layer where PHP lives.

The bursts between 9/21 and 9/23 only affected about 15% of our customers due to how the database clusters are segregated but coincided with an increase update of 4.7.18. One of the main changes in 4.7.18 was to how often the write MySQL servers are used. The write servers are really good at writing (insert/delete/update) but less useful for complex select queries. One downside of user a read/write separation is the replication delay. You can insert a record to the write server, and this has to then copy it across to the read servers. So, when we recount the last post in a topic, and forum, and recount the number of comments, etc we run that select query on the write server so we know it has the latest data. This is fine, but it puts a heavy load on the write servers. So, in .18 we removed the select queries from the write server and added a task to recount again every five minutes or so just incase there are any odd issues from race conditions on busy sites (and we have some super busy sites - one currently has 36,000 active sessions).

After a lot of debugging, we tracked the issue down to the use of ElastiCache to manage the locking flags when recounting. This meant that busy sites couldn't lock fast enough with ElastiCache as there is a very tiny window of replication lag. So instead of the expensive recount query running just once, it would run 3, 5, or 10 times before the lock was created. Multiply this for all sites and it increased load at the database level due to InnoDB locking and unlocking rows.

We tried several interventions which seemed to work, but randomly did not a day later. This is very frustrating for us, and very frustrating for you.

Yesterday, we found a solution via a hotfix deployed to all 4.7.18 sites on our platform to use database locking. It drives up database I/O a little but not enough to cause concern, and we have rewritten this recounting feature already for .19 to use a task which has more robust locking and is proven to avoid race conditions.

Yesterday we did see random latency issues that affected most sites between 10am and 12pm EST on and off, with peaks occurring around 10am and 11:30am. This is in the ElastiCache layer which we're woking in, although we have made changes to the configuration to make that stable.

This has taken a few days to get to the bottom of and has involved multiple members of our engineering team and some long days, so I thank you for your patience. These burst happened so quickly (in relative terms, I know it feels like forever when you experience it on your site) that our external status monitoring doesn't pick it up, but rest assured, our internal monitoring does. It's very loud and impossible to ignore. 😄

I'll address some of the comments:

20 hours ago, Dave MacDonald said:

Not that it's particualrly helpful but just wanted to add a +1 to the above.

Our users are reporting very slow response times and multiple instances of "this community is temporarily unavailable"

I know this is your bread and butter now so I know it will be fixed, but equally it is frustrating that the system status says everything is fine, when it's clearly not!

As mentioned above, these burst are over before our external status monitoring picks it up and often doesn't affect all sites due to the way the MySQL clusters are set up.

3 hours ago, Alex Duffy said:

All good today!

Thanks, we had resolved most of the issues last night.

2 hours ago, David N. said:

My site (and this site) are again very slow on and off right now. Some pages load fast, some take 4-6 seconds to load. 

We were running some very short term ElastiCache configuration tests. We were monitoring the response times but needed a few minutes to gather some data. This lasted about 8 minutes total.

Again, thank you for your patience and I know that it can seem like nothing is happening, but we have strong internal monitoring and have been focused on resolving these latency issues. A large complex platform like ours can be quite organic and tough to diagnose as GitLab found out when experiencing similar randomly latency issues.

On 9/24/2024 at 12:53 PM, Gary Lewis said:

I may have migrated to the wrong platform.  I say that for two reasons.  First, because we've been here for six days and have experienced outages an very poor response times.  Not something we are willing to pay for.

Second, if my comments have to be approved by a moderator then I KNOW I'm in the wrong place.

I will be taking this up with the people who moved me here.

I can only apologise. The past few days are not indicative of our normal service. I'll reply to your ticket in more detail.

Posted (edited)
24 minutes ago, Matt said:

Thank you all for your patience, I have approved every single post that was made both good and bad, and will hopefully address some of the points raised and the issues from the past 3-4 days.

It's been a very busy few days but here's a brief run down of events.

We were alerted to short (5-8 minute) bursts of a sharp increase in response times starting on the 21st.

Could contain: Chart, Bar Chart

These bursts didn't last long which made it hard to track down. Our Cloud platform is made from several components, each of which could cause latency issues. We have a WAF to filter traffic, a CDN to create short term caches for guest traffic, ElastiCache (in a read/write cluster), MySQL database clusters (multiple read/write) and then the processing layer where PHP lives.

The bursts between 9/21 and 9/23 only affected about 15% of our customers due to how the database clusters are segregated but coincided with an increase update of 4.7.18. One of the main changes in 4.7.18 was to how often the write MySQL servers are used. The write servers are really good at writing (insert/delete/update) but less useful for complex select queries. One downside of user a read/write separation is the replication delay. You can insert a record to the write server, and this has to then copy it across to the read servers. So, when we recount the last post in a topic, and forum, and recount the number of comments, etc we run that select query on the write server so we know it has the latest data. This is fine, but it puts a heavy load on the write servers. So, in .18 we removed the select queries from the write server and added a task to recount again every five minutes or so just incase there are any odd issues from race conditions on busy sites (and we have some super busy sites - one currently has 36,000 active sessions).

After a lot of debugging, we tracked the issue down to the use of ElastiCache to manage the locking flags when recounting. This meant that busy sites couldn't lock fast enough with ElastiCache as there is a very tiny window of replication lag. So instead of the expensive recount query running just once, it would run 3, 5, or 10 times before the lock was created. Multiply this for all sites and it increased load at the database level due to InnoDB locking and unlocking rows.

We tried several interventions which seemed to work, but randomly did not a day later. This is very frustrating for us, and very frustrating for you.

Yesterday, we found a solution via a hotfix deployed to all 4.7.18 sites on our platform to use database locking. It drives up database I/O a little but not enough to cause concern, and we have rewritten this recounting feature already for .19 to use a task which has more robust locking and is proven to avoid race conditions.

Yesterday we did see random latency issues that affected most sites between 10am and 12pm EST on and off, with peaks occurring around 10am and 11:30am. This is in the ElastiCache layer which we're woking in, although we have made changes to the configuration to make that stable.

This has taken a few days to get to the bottom of and has involved multiple members of our engineering team and some long days, so I thank you for your patience. These burst happened so quickly (in relative terms, I know it feels like forever when you experience it on your site) that our external status monitoring doesn't pick it up, but rest assured, our internal monitoring does. It's very loud and impossible to ignore. 😄

I'll address some of the comments:

As mentioned above, these burst are over before our external status monitoring picks it up and often doesn't affect all sites due to the way the MySQL clusters are set up.

Thanks, we had resolved most of the issues last night.

We were running some very short term ElastiCache configuration tests. We were monitoring the response times but needed a few minutes to gather some data. This lasted about 8 minutes total.

Again, thank you for your patience and I know that it can seem like nothing is happening, but we have strong internal monitoring and have been focused on resolving these latency issues. A large complex platform like ours can be quite organic and tough to diagnose as GitLab found out when experiencing similar randomly latency issues.

I can only apologise. The past few days are not indicative of our normal service. I'll reply to your ticket in more detail.

I’m currently self-hosted, but plan on switching to the cloud offering in the future. I appreciate everyone sharing their experiences so far, as well as the detailed breakdowns of what’s being done to identify issues and how they’re being addressed.

It’s not ideal for anyone when this kind of stuff happens, but this open communication will result in continuous improvements.

 Thanks everyone. 

Edited by Mike G.
  • Management
Posted
1 minute ago, Mike G. said:

I’m currently self-hosted, but plan on switching to the cloud offering in the future. I appreciate everyone sharing their experiences so far, as well as the detailed breakdowns of what’s being done to identify  uses and how they’re being addressed.

It’s not ideal for anyone when this kind of stuff happens, but this open communication will result in continuous improvements.

 Thanks everyone. 

Thanks Mike. It's never nice when these issues arise but we are responsive to them and work hard to resolve them. 99.9% of the time it's smooth sailing, but these 0.01% events sure are memorable. 😄

Posted
3 hours ago, Matt said:
On 9/24/2024 at 6:53 AM, Gary Lewis said:

I may have migrated to the wrong platform.  I say that for two reasons.  First, because we've been here for six days and have experienced outages an very poor response times.  Not something we are willing to pay for.

Second, if my comments have to be approved by a moderator then I KNOW I'm in the wrong place.

I will be taking this up with the people who moved me here.

I can only apologise. The past few days are not indicative of our normal service. I'll reply to your ticket in more detail.

I did get a very good set of personal responses from Matt, who is part of InVision's management.  I'm satisfied that they are working through a very complex situation and are learning from this issue.  I really like the platform and am hopeful that this was a one-time glitch.

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...