[RESOLVED] SaaS - EU outage on 2019-05-22 - #20190522[A]
Incident Description
On 2019-05-22 - 03:08 AM, our SaaS-EU datacenter became unreachable.
From the outside world, it looks like a complete outage.
- Incident Start Time: 2019-05-22 - 03:08 AM.
- Restoration Time:
- LRC traffic: 2019-05-22 - 10:00 AM.
- GUI/API restoration time: 2019-05-22 - 06:45 PM
- Service(s) Impact Duration:
- LRC Traffic: 532 minutes (8 hours, 52 minutes).
- GUI/API restoration time: 937 minutes (15 hours, 37 minutes)
- Severe service impact: OSS (GUI and API), LRC are not responsive. So far, from the outside world, it looks like a complete outage.
- Non impacted services : N/A.
Timeline (UTC)
- [2019-05-22 - 03:08 AM] : EQX availability alert is raised.
- [2019-05-22 - 03:35 AM] : operations team confirm that service fails.
- [2019-05-22 - 03:39 AM] : infrastructure team escalates to providers - it's a network issue.
- [2019-05-22 - 08:06 AM] : The issue is still on-going. We will have to put in place a recovery plan. This will take a while.
- [2019-05-22 - 10:00 AM] : Recovery is happening. LRC Network Server is back. GUI is coming back up again as well.
- [2019-05-22 - 11:00 AM] : degraded mode until further notice, since approx. 1 hour ago
- [2019-05-22 - 01:45 PM] : degraded mode on one-site platform. Traffic effective on LRC Network Server. GUI & API are unstable.
- [2019-05-22 - 06:45 PM] : Site B recovered and online. LRC traffic and GUI recovered -- status under observation
Root Cause Analysis
Main root cause is for sure faulty switch but why this impacted so strongly both sites is still under investigation.
The fault was not obvious and the diagnosis was very difficult because the admin access was also using this switch.
When the network came back, the system had to manage a sizable messages flow due to gateway connections being reestablished and their messages being dequeued (buffers).
Also it was flooded by alarms and statistics.
The short term action will be to replace the faulty switch and to create an access for each site using dedicated links.
The long term action would be to improve how the system manages the load after a severe network issue.
For that purpose, we went on site to gather traces and we will analyse them.
This section will be updated when we know more.
Actility Support