[RESOLVED] SaaS - EU outage on 2019-05-22 - #20190522[A]
Incident Description
On 2019-05-22 - 03:08 AM, our SaaS-EU datacenter became unreachable.
From the outside world, it looks like a complete outage.
- Incident Start Time: 2019-05-22 - 03:08 AM.
- Restoration Time:
- LRC traffic: 2019-05-22 - 10:00 AM.
- GUI/API restoration time: 2019-05-22 - 06:45 PM
- Service(s) Impact Duration:
- LRC Traffic: 532 minutes (8 hours, 52 minutes).
- GUI/API restoration time: 937 minutes (15 hours, 37 minutes)
- Severe service impact: OSS (GUI and API), LRC are not responsive. So far, from the outside world, it looks like a complete outage.
- Non impacted services : N/A.
Timeline (UTC)
- [2019-05-22 - 03:08 AM] : EQX availability alert is raised.
- [2019-05-22 - 03:35 AM] : operations team confirm that service fails.
- [2019-05-22 - 03:39 AM] : infrastructure team escalates to providers - it's a network issue.
- [2019-05-22 - 08:06 AM] : The issue is still on-going. We will have to put in place a recovery plan. This will take a while.
- [2019-05-22 - 10:00 AM] : Recovery is happening. LRC Network Server is back. GUI is coming back up again as well.
- [2019-05-22 - 11:00 AM] : degraded mode until further notice, since approx. 1 hour ago
- [2019-05-22 - 01:45 PM] : degraded mode on one-site platform. Traffic effective on LRC Network Server. GUI & API are unstable.
- [2019-05-22 - 06:45 PM] : Site B recovered and online. LRC traffic and GUI recovered -- status under observation
Root Cause Analysis
Main root cause is for sure faulty switch but why this impacted so strongly both sites is still under investigation.
The fault was not obvious and the diagnosis was very difficult because the admin access was also using this switch.
When the network came back, the system had to manage a sizable messages flow due to gateway connections being reestablished and their messages being dequeued (buffers).
Also it was flooded by alarms and statistics.
The short term action will be to replace the faulty switch and to create an access for each site using dedicated links.
The long term action would be to improve how the system manages the load after a severe network issue.
For that purpose, we went on site to gather traces and we will analyse them.
This section will be updated when we know more.
Actility Support
Related Articles
[RESOLVED] SaaS - EU (Paris) outage on 2019/03/05 - #20190305A
Incident Description On 3/5/2019 11:30:00 GMT an incident alarm was raised on Actility EU SaaS plaform. The following component was impacted: TWA, SMP and DX-API Incident information and Service impact Incident Start Time : 3/5/2019 11:30:00 PM ...
[RESOLVED] SaaS - EU (Paris) outage on 2019/01/15 - #20190115B
On 2019-01-15T01:25Z, following planned hardware maintenance and incident #20190115A, site B went offline. Actility technical teams are working on this issue. Current state Resolved. Incident information and Service impact Incident Start Time: ...
[RESOLVED] - SaaS-EU (Paris) GUI Access degradation on 2019/07/01 for TWA applications - #20190701A
Incident Description On 2019-07-01 - 04:35 AM (GMT), the SaaS-EU had an issue when accessing the GUI, using OSS API and using DX API. Because it was a random issue without malfunction or failed stop of a server, there was no switch-over ...
[RESOLVED] - SaaS-EU (PAris) GUI Access degradation on 2019/05/30 for TWA applications - #20190530A
Incident Description On 2019-30-05 - 03:29 AM (CEST), the SaaS-EU had a random issue when accessing the GUI, using OSS and using DX API. Because it was a random issue without malfunction or fail stop of a server, there was no switchover ...
[RESOLVED] - LRC outage for SaaS-EU on 2020-04-25
On 2020-04-25, our LRC end-to-end probe detects a traffic issue. The alarm is automatically cleared within five minutes: the LRC process was restarted after catastrophic memory failure. Incident information and Timeline (UTC) [summary] The problem ...