[RESOLVED] SaaS - EU outage on 2019-05-22 - #20190522[A]

Incident Description

On 2019-05-22 - 03:08 AM, our SaaS-EU datacenter became unreachable.
From the outside world, it looks like a complete outage.

Incident information and Service impact

Incident Start Time: 2019-05-22 - 03:08 AM.
Restoration Time:

LRC traffic: 2019-05-22 - 10:00 AM.
GUI/API restoration time: 2019-05-22 - 06:45 PM

Service(s) Impact Duration:

LRC Traffic: 532 minutes (8 hours, 52 minutes).
GUI/API restoration time: 937 minutes (15 hours, 37 minutes)

Severe service impact: OSS (GUI and API), LRC are not responsive. So far, from the outside world, it looks like a complete outage.
Non impacted services : N/A.

Timeline (UTC)

[2019-05-22 - 03:08 AM] : EQX availability alert is raised.
[2019-05-22 - 03:35 AM] : operations team confirm that service fails.
[2019-05-22 - 03:39 AM] : infrastructure team escalates to providers - it's a network issue.
[2019-05-22 - 08:06 AM] : The issue is still on-going. We will have to put in place a recovery plan. This will take a while.
[2019-05-22 - 10:00 AM] : Recovery is happening. LRC Network Server is back. GUI is coming back up again as well.
[2019-05-22 - 11:00 AM] : degraded mode until further notice, since approx. 1 hour ago
[2019-05-22 - 01:45 PM] : degraded mode on one-site platform. Traffic effective on LRC Network Server. GUI & API are unstable.
[2019-05-22 - 06:45 PM] : Site B recovered and online. LRC traffic and GUI recovered -- status under observation

This timeline was updated live on our status page .

Root Cause Analysis

Main root cause is for sure faulty switch but why this impacted so strongly both sites is still under investigation.
The fault was not obvious and the diagnosis was very difficult because the admin access was also using this switch.

When the network came back, the system had to manage a sizable messages flow due to gateway connections being reestablished and their messages being dequeued (buffers).
Also it was flooded by alarms and statistics.

The short term action will be to replace the faulty switch and to create an access for each site using dedicated links.
The long term action would be to improve how the system manages the load after a severe network issue.
For that purpose, we went on site to gather traces and we will analyse them.

This section will be updated when we know more.

Actility Support

Related Articles
[RESOLVED] - SaaS-EU (Paris) GUI Access degradation on 2019/07/01 for TWA applications - #20190701A
Incident Description On 2019-07-01 - 04:35 AM (GMT), the SaaS-EU had an issue when accessing the GUI, using OSS API and using DX API. Because it was a random issue without malfunction or failed stop of a server, there was no switch-over ...
[RESOLVED] - SaaS-EU (PAris) GUI Access degradation on 2019/05/30 for TWA applications - #20190530A
Incident Description On 2019-30-05 - 03:29 AM (CEST), the SaaS-EU had a random issue when accessing the GUI, using OSS and using DX API. Because it was a random issue without malfunction or fail stop of a server, there was no switchover ...
[RESOLVED] SaaS - EU (Paris) outage on 2019/03/05 - #20190305A
Incident Description On 3/5/2019 11:30:00 GMT an incident alarm was raised on Actility EU SaaS plaform. The following component was impacted: TWA, SMP and DX-API Incident information and Service impact Incident Start Time : 3/5/2019 11:30:00 PM ...
[RESOLVED] - LRC outage for SaaS-EU on 2020-04-25
On 2020-04-25, our LRC end-to-end probe detects a traffic issue. The alarm is automatically cleared within five minutes: the LRC process was restarted after catastrophic memory failure. Incident information and Timeline (UTC) [summary] The problem ...
[RESOLVED] - OVH (our datacenter) outage on 2017/11/09 - #201701109A
Since the 2017/11/09 07:43 AM (CET) our datacenter at OVH is unreachable and this impact big part of Actility services. Actility services are accessible from certain locations/networks. Issues are located on some global routers by OVH (out of our ...

SaaS - EU outage on 2019-05-22 - 20190522A

[RESOLVED] SaaS - EU outage on 2019-05-22 - #20190522[A]

Incident Description

On 2019-05-22 - 03:08 AM, our SaaS-EU datacenter became unreachable. From the outside world, it looks like a complete outage.

Incident information and Service impact

Timeline (UTC)

Root Cause Analysis

Related Articles

[RESOLVED] - SaaS-EU (Paris) GUI Access degradation on 2019/07/01 for TWA applications - #20190701A

[RESOLVED] - SaaS-EU (PAris) GUI Access degradation on 2019/05/30 for TWA applications - #20190530A

[RESOLVED] SaaS - EU (Paris) outage on 2019/03/05 - #20190305A

[RESOLVED] - LRC outage for SaaS-EU on 2020-04-25

[RESOLVED] - OVH (our datacenter) outage on 2017/11/09 - #201701109A

On 2019-05-22 - 03:08 AM, our SaaS-EU datacenter became unreachable.
From the outside world, it looks like a complete outage.