Jump to content

Recommended Posts

Posted

There were more issues this morning, where I could no longer log in to my site for several minutes. 

Now just now I had some slow response. Trying to submit an answer took over 15 seconds. I checked with downforeveryone and my site was down. 

Could contain: File, Webpage, Page, Text

  • 1 month later...
Posted
2 hours ago, David N. said:

Both my site and invisioncommunity.com were down again just now: 

Could contain: File, Webpage

 

Could contain: Page, Text, File, Webpage

 

Sorry for the inconvenience. There was a short issue which our cloud team immediately resolved. Do not worry, our Cloud team is always around on holidays to ensure our and our client’s communities are up.

Posted (edited)

More downtime again this morning, for my site and this time, and for longer periods. For more than 40 minutes from 9:42am to 10:23am CEST my site was mostly down with brief periods of excruciatingly slow uptime throwing all kinds of errors. 

The issue is still ongoing at this time. 

Could contain: File, Webpage, Page, Text

 

Could contain: File, Webpage, Text

Could contain: Page, Text, File, Webpage

Edited by David N.
Posted (edited)
1 hour ago, Matt said:

We are experiencing a significant DDoS attack (4.5 million requests) which we are taking steps to mitigate. 

Thanks Matt. Doesn't AWS protect against DDoS attacks? https://aws.amazon.com/shield/

1 hour ago, Matt said:

Any significant changes will be added here: https://status.invisioncommunity.com

I don't understand why, like for the downtime on December 23rd, while under "History & Incidents", the downtime for invisioncommunity.com was tracked, the downtime for the U.S. Cloud service (which I'm using) is not tracked, so that it says the uptime is 100% when it is not - far from it.

Edited by David N.
Posted

Remember there are multiple servers and components involved. You’re thinking in terms of single servers where it’s binary… up or down and there is nothing in between. 

In enterprise architectures, there might be an outage that affects a percentage of users. It could literally be somewhere between 0 and 100. Users in a geographic location might only have problems or users that happened to connect to one portion of their network might be impacted. 

Most monitoring services (if they’re worth anything) check from multiple locations and report a failure when enough reporting stations agree there is a problem. So it might not catch isolated or regional issues that does not have widespread impact. 

Posted
15 minutes ago, Randy Calvert said:

So it might not catch isolated or regional issues that does not have widespread impact. 

This was not an isolated or regional issue. Here's what uptrends.com measured during the downtime: 

Could contain: Text

Posted

So has this problem since been fixed? While I noticed no "slowing" of our cloud community, suddenly users randomly get a 403 "Request could not be satisfied" error randomly while utilizing our sites rest API.

Posted
2 hours ago, InfinityRazz said:

So has this problem since been fixed? While I noticed no "slowing" of our cloud community, suddenly users randomly get a 403 "Request could not be satisfied" error randomly while utilizing our sites rest API.

403 is different than any outage error. That typically states you do not have permission. You will want to check how you're using the API and ensure you're not excessively sending requests, always including a user agent, and other typical best practices.

Posted
8 hours ago, Jim M said:

403 is different than any outage error. That typically states you do not have permission. You will want to check how you're using the API and ensure you're not excessively sending requests, always including a user agent, and other typical best practices.

I'll admit one use case was my fault.. Was trying to use 'ExecuteAsync' instead of 'PostAsync' to generate an oAuth token (whoops)

However (and we have tested with multiple users): ever since December 25/26 once a user connects to our site on their 3rd -5th account -> ALL of their connected tokens start throwing 403 exceptions and wind up crashing their active sessions. 

Upon login we should only be calling oAuth/token for the token-> /core/me for user id/email -> nexus/purchases for users purchases, then query every active license for its custom fields (1-10 license) + it's corresponding records DB entry, as well  as periodic heartbeat to /core/hello to verify token validity/site connection which doesn't seem like an "excessive" amount of rest API calls to me.

Keeping in mind we haven't touched and/or made changes to the client side of the API calls in over a year.

Posted

More server errors again today (just a few minutes ago). Again, geographically widespread, and both on my site and this site. 

The downtime is tracked by uptrends.com and downforeveryoneorjustme.com but not tracked by status.invisioncommunity.com

Could contain: Page, Text, File 

Could contain: Page, Text, File, Business Card, Paper

Could contain: File, Webpage

Could contain: Text

 

Posted

Also getting this issue, tons of 4XX errors on my RestAPI calls in the last hour or so. Couldn't connect to here via PC but had no issue via mobile however 🤷‍♂️ Some of my users have been having issues since December 24/25 at this point, and at least 1 of the sites mentioned above by @David have shown multiple servers offline the whole week while Invision status tracker shows no problem.

Posted
2 hours ago, Stuart Silvester said:

We're still actively mitigating the DDOS attack on some sections of our network. This isn't a network-wide issue and does not affect all customers.

4xx errors are likely to be WAF related, such as making too many requests in a short time.

So again, we haven't changed how we handle Rest API requests in a long while now. The library performing said requests hasn't been recompiled in nearly a year as it's been stable until 7 days ago.

This seems to be the most common error our users get (Again, we have not changed the type or frequency of calls)
Could contain: Text, Computer Hardware, Electronics, Hardware

Like sure we've had a small influx of new users this month, but it's just replacing users that have already left our community. I would like some clarification on what classifies as "Too many requests" and "Short Time" please.
We hardly ever have 65 unique users online on the website / making requests via oAuth at any given moment. Nor do I see any immediate indication of our site hitting our user cap in admin panel.

According to admin panel: "Active user" graph
Could contain: Chart
Bandwidth usage:
 Could contain: Page, Text
I don't see anything in system or error logs to indicate my users having connection issues, nor are Rest API requests made with an oAuth token actually logged in our logs (they're supposed to when checked no?) so I don't get to see what the users actually doing/experiencing.

Could it be they're hitting a node that's under attack and therefore can't connect to the site properly? (they do report slow connection speeds via browser as well). Majority of users are fine, but it's the one's affected that complain the loudest 🤷‍♀️

  • 3 weeks later...
Guest
This topic is now closed to further replies.
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...