Jump to content

IPS Cloud - downtime on 21st March


Recommended Posts

Hi, I would like to point out in advance that I am starting this topic to hear the official position of IPS, not to make fun of etc. You constantly negate external infrastructures in favor of your cloud. And how will you explain your TOTAL DOWNTIME 12h 42m in the last 7 days? I would like to add that my dedicated server has had 100% UPTIME for a year. But if I was paying that much money for a cloud and seeing a 500 error for 8 hours, I would have given up on the service pretty quickly. Now let's imagine that during downtime some big business forum has important Live Topics?

 Could contain: Text

Edited by Grafidea
Link to comment
Share on other sites

  • Management

Unfortunately we experienced an issue with AWS this morning that even AWS is not sure what is wrong at this point. It was not really down as long as that reports there as we put up a message and did not remove it until we were sure everything was good.

Any systems can have downtime events, even something like ours with redundancy built in. It is regrettable especially when we do everything right to avoid it but something outside our control breaks. Especially when no one, not even your provider, can tell you what went wrong.

This is really the first major downtime event in years but of course that does not matter in the moment. Frustrating all around 🙂

Link to comment
Share on other sites

4 minutes ago, Charles said:

Unfortunately we experienced an issue with AWS this morning that even AWS is not sure what is wrong at this point. It was not really down as long as that reports there as we put up a message and did not remove it until we were sure everything was good.

Any systems can have downtime events, even something like ours with redundancy built in. It is regrettable especially when we do everything right to avoid it but something outside our control breaks. Especially when no one, not even your provider, can tell you what went wrong.

This is really the first major downtime event in years but of course that does not matter in the moment. Frustrating all around 🙂

That's how I see it, but a grossly long downtime.

Link to comment
Share on other sites

And by the way...  my site was down as well from 11:01PM onwards.  I'm not exactly happy about the situation either.  However I'm going to let IPS work with AWS to determine the root cause of the problem.  My plan is to follow up in a few days to learn more because this type of outage warrants the investigation.  

Like anything online, things can and do break.  The question is what you learn about each instance and how you improve from it going forward.  A true root cause analysis not just looks at what went wrong and why, but also looks at what can be done to prevent it from occurring again.  And I think these guys know that and will work through that process.  

Link to comment
Share on other sites

  • Management

We did experience an outage, however it was 6 hours, not 12 hours.

Our status page monitors this forum (which was down for 6 hours) and the US cloud platform (also down for 6 hours). Your screenshot adds them together.

This doesn't make it right, any extended outage is not ideal, but I just wanted to fact check the 12 hours claim which is double the downtime that occurred.

As Charles said, it is very unusual for us to experience this, so we are incredibly frustrated. Our uptime for 2022 was 100%, and so far this year including this event, it's 99.68% so it really is a very rare occurrence.

It doesn't mean we're being complacent, but it illustrates that this isn't the usual experience.

Could contain: Page, Text

Link to comment
Share on other sites

  • Management
37 minutes ago, Randy Calvert said:

And by the way...  my site was down as well from 11:01PM onwards.  I'm not exactly happy about the situation either.  However I'm going to let IPS work with AWS to determine the root cause of the problem.  My plan is to follow up in a few days to learn more because this type of outage warrants the investigation.  

Like anything online, things can and do break.  The question is what you learn about each instance and how you improve from it going forward.  A true root cause analysis not just looks at what went wrong and why, but also looks at what can be done to prevent it from occurring again.  And I think these guys know that and will work through that process.  

Thanks Randy,

We're taking this very seriously. More information will follow.

Link to comment
Share on other sites

2 hours ago, Charles said:

This is really the first major downtime event in years but of course that does not matter in the moment. Frustrating all around 🙂

Hopefully to be the last, it is a crucial time for me to grow my forums competing in the space around a particular video game that is sunsetting their official forums for discord. I am actively in a bidding war trying to sell my site to new users who were unable to access the site for 6 hours, luckily its back at all and this wasn't a worse disaster. 

 

1 hour ago, Matt said:

Our status page monitors this forum (which was down for 6 hours) and the US cloud platform (also down for 6 hours). Your screenshot adds them together.

I have to ask, how was this page up and the rest of Invision was not accessible? Is there any information public from Amazon that I could look into? 

Edited by Shawn RR
Link to comment
Share on other sites

8 minutes ago, Shawn RR said:

I have to ask, how was this page up and the rest of Invision was not accessible?

IPS uses a 3rd party monitoring company (uptime.com) to monitor its network.  It's not run on the IPS infrastructure.

If there is ever a situation in which there is a core critical issue, check the status page (https://status.invisioncommunity.com) to get an idea if it's just you or if it's widespread.  

11 minutes ago, Shawn RR said:

Is there any information public from Amazon that I could look into? 

AWS does not publicly post this type of info, especially as it relates to their customers.   

Link to comment
Share on other sites

Yeah, that sucked, I was using my forum when it went down yesterday evening, and I quickly noticed that Invision's entire site was also down, so I figure all cloud hosting customers were affected. 

What concerns me is that there was no communication about this for a long time.  Hours went by before I started seeing some changes in what was displayed, as if someone was (finally) investigating what was going on.  At one point I saw an Invision 4 Installation page, as well as a page that just said "It works!"   That concerned me even more, and again, this was before there was any communication at all.

It seems that an event of this magnitude would result in alarms going off that would get the attention of those who could work on rectifying the problem.

Link to comment
Share on other sites

4 minutes ago, AtariAge said:

Yeah, that sucked, I was using my forum when it went down yesterday evening, and I quickly noticed that Invision's entire site was also down, so I figure all cloud hosting customers were affected. 

What concerns me is that there was no communication about this for a long time.  Hours went by before I started seeing some changes in what was displayed, as if someone was (finally) investigating what was going on.  At one point I saw an Invision 4 Installation page, as well as a page that just said "It works!"   That concerned me even more, and again, this was before there was any communication at all.

It seems that an event of this magnitude would result in alarms going off that would get the attention of those who could work on rectifying the problem.

It should, but we are talking about AWS here and customers are not exactly at the top of their list.  Invision staff keeping us updated on what Amazon is doing is a noble goal, but just like most admins don't know what is happening at their server's data center... IPS is kind of stuck with the rest of us.

Link to comment
Share on other sites

Just now, UncrownedGuard said:

It should, but we are talking about AWS here and customers are not exactly at the top of their list.  Invision staff keeping us updated on what Amazon is doing is a noble goal, but just like most admins don't know what is happening at their server's data center... IPS is kind of stuck with the rest of us.

That wasn't my point, my point is that if someone was aware of this as soon as Invision's site was down (especially given the nature of it where it took down their entire cloud service), they could have started investigating immediately.  Since Invision is using a third-party monitoring service, I assume it's possible to send notifications via several different means (SMS, emails, etc.) when an event like this occurs.  And if the service they are using now doesn't support that type of functionality, there are plenty of alternatives that do. 

Link to comment
Share on other sites

  • Management
Just now, AtariAge said:

That wasn't my point, my point is that if someone was aware of this as soon as Invision's site was down (especially given the nature of it where it took down their entire cloud service), they could have started investigating immediately.  Since Invision is using a third-party monitoring service, I assume it's possible to send notifications via several different means (SMS, emails, etc.) when an event like this occurs.  And if the service they are using now doesn't support that type of functionality, there are plenty of alternatives that do. 

Yes, we do just that. You can subscribe to the status page and get email updates.

Link to comment
Share on other sites

Just now, Charles said:

Yes, we do just that. You can subscribe to the status page and get email updates.

Given it was hours before anything was posted at all (I was looking at that page quite frequently), it wasn't particularly helpful in this case.  And I don't need that status page to see that my forum was down, but it was useful to see that it wasn't just my site that was affected.  When did Invision actually start working on this issue?   Again, better communication would have been useful, rather than being left in the dark for hours.

Link to comment
Share on other sites

6 hours of downtime is simply unacceptable. I'm a huge fan of Invision, love the amazing software they develop and maintain for us. But when I pay 150/month for a hosted service, I expect more. In 20 years of being self hosted on inexpensive shared hosting services, I've never had more than a few minutes of downtime at a time. 

The fact that Invision relies on AWS and that the issue may lie with AWS is irrelevant to us, the customers. As a professional, if I can't deliver a service to my customer because Invision is down, I can't charge my customers. If my forum is down, I can't earn revenue from ads. I can't just tell the customers (or the ad network) "it's not my fault, the software cloud service I'm using was down" and continue to charge them the same amount. I have no choice but to take responsibility and suffer the loss in revenue and reputation. 

Invision should  take responsibility here, offer $ compensation to paying customers, and deal with whatever service providers they use on their own term. As Invision customers, that's not our problem. 

Edited by David N.
Link to comment
Share on other sites

I would like to reply to this thread just to encourage Invision to continue to offer self hosting options and as new features come available I would like to see them available to self hosting customers.  I’m a big fan of this software, and but I do by far prefer the self hosting option.  I appreciate the hosted option, but it is not what I prefer.  

Link to comment
Share on other sites

6 hours ago, AtariAge said:

That wasn't my point, my point is that if someone was aware of this as soon as Invision's site was down (especially given the nature of it where it took down their entire cloud service), they could have started investigating immediately.  Since Invision is using a third-party monitoring service, I assume it's possible to send notifications via several different means (SMS, emails, etc.) when an event like this occurs.  And if the service they are using now doesn't support that type of functionality, there are plenty of alternatives that do. 

Oh yes, that would actually be a great idea.

6 hours ago, Charles said:

Yes, we do just that. You can subscribe to the status page and get email updates.

I actually didn't know this was a thing 😅

Link to comment
Share on other sites

  • Management

We share your frustration and I assure you that although this is the first major outage event in years and we have a solid 99%+ record, we are taking this very seriously and have already taken both internal and vendor related steps to minimize the risk of this happening again. We have also made improvements to our status and communications system so you remain better informed throughout these unlikely events. 

I'm deeply sorry for the inconvenience this event has caused. Cloud clients with further concerns are welcome to send a support email and we will be happy to discuss. 

Thank you!

Link to comment
Share on other sites

With multi-region and multi-AZ redundancy and availability promising upwards of five 9s (assuming the application is built to conform to these requirements) how exactly does this happen and AWS not know why nearly immediately, nevermind many hours later?  

I'm baffled by their response especially given the reliability they claim as a major selling point of not just the cloud but specifically their offerings to mitigate this even reactively.  Something is tremendously off with the situation and hoping to hear some discovery, lessons learned, and mitigations from IPS as this undoubtedly impacts their own selling strengthen of their cloud solution. 

Link to comment
Share on other sites

  • Management
10 hours ago, Clover13 said:

With multi-region and multi-AZ redundancy and availability promising upwards of five 9s (assuming the application is built to conform to these requirements) how exactly does this happen and AWS not know why nearly immediately, nevermind many hours later?  

I'm baffled by their response especially given the reliability they claim as a major selling point of not just the cloud but specifically their offerings to mitigate this even reactively.  Something is tremendously off with the situation and hoping to hear some discovery, lessons learned, and mitigations from IPS as this undoubtedly impacts their own selling strengthen of their cloud solution. 

There is much more to a SaaS platform than just servers and services. There are micro-services that tie these services together. We believe it was a micro service that we have used for over four years that failed. Our investigation is ongoing with AWS.

As Lindy mentioned, please get in touch if you want to lodge a formal complaint and we will do our best to resolve it.

I want to underline again that we are proud of our platform stability, with 99.99% uptime over several years. Incidents like this are rare but we do not take them lightly and have implemented an audit of alert systems, uptime monitoring, service statuses, etc to ensure we do not have a repeat of this event again.

Link to comment
Share on other sites

On 3/22/2023 at 4:30 PM, David N. said:

6 hours of downtime is simply unacceptable. I'm a huge fan of Invision, love the amazing software they develop and maintain for us. But when I pay 150/month for a hosted service, I expect more. In 20 years of being self hosted on inexpensive shared hosting services, I've never had more than a few minutes of downtime at a time. 

The fact that Invision relies on AWS and that the issue may lie with AWS is irrelevant to us, the customers. As a professional, if I can't deliver a service to my customer because Invision is down, I can't charge my customers. If my forum is down, I can't earn revenue from ads. I can't just tell the customers (or the ad network) "it's not my fault, the software cloud service I'm using was down" and continue to charge them the same amount. I have no choice but to take responsibility and suffer the loss in revenue and reputation. 

Invision should  take responsibility here, offer $ compensation to paying customers, and deal with whatever service providers they use on their own term. As Invision customers, that's not our problem. 

Is there somewhere in the terms of the cloud service that states 100% uptime? Do you call your internet provider and ask for compensation when it goes down as well?

If this was a consistent pattern of downtime, I would be upset. However, this is the first time my community has gone down and I understand it happens. The way I see it, I am paying for a service and if it continues to go down I have the option to look else where. For my internet provider, I don't have any other option 🥲.

Hard to foresee unexpected downtime...

 

Link to comment
Share on other sites

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...