[RESOLVED] SaaS - EU (Paris) outage on 2019/01/15 - #20190115B

On 2019-01-15T01:25Z, following planned hardware maintenance and incident #20190115A, site B went offline.
Actility technical teams are working on this issue.

Current state

Resolved.

Incident information and Service impact

Incident Start Time: 2019-01-15T13:25Z.
Service Restoration Time: 2019-01-15T15:20Z.
Service(s) Impact Duration: 1 hours 55 minutes on the Network Server
Severe service impact: OSS (GUI and API) is not responsive. Provisioning and downlinks from external AS fail. Provisioning through DX-API works.
Little service impact: Missing Uplink/Downlink packets if gateways not connected to primary LRC (site A).
No Service impact: end-to-end uplink communication through primary LRC (site A).

Timeline (2019-01-15, UTC)

The status is updated live on the status page.

[13:25] - Starting planned hardware maintenance on ESX01 virtualisation platform

[13:29] - NetOps team notices some VM are out (services are unstable)

[13:34] - NetOps team confirms all VM are up & running on previously upgraded ESX02 (cf. #20190115A) but GUI are still out (diagnosis in progress on ESX02)

[13:38] - NetOps/IS&T teams escalate to Commissioning subcontractor regarding a potential network issue

[13:47] - ESX01 back online; GUI are still out

[14:01] - NetOps prepares the DNS switchover (from site B to site A) but does not fire it

[14:11] - Commissioning subcontractor acknowledges a network issue: starting network debug with on-site actions (IS&T)

[14:17] - Network issue confirmed: need to test the whole wiring

[14:29] - Network issue solved but GUI are still out

[14:33] - DNS switch triggered

[14:40] - Rebooting ESX02

[14:46] - VM are partially back on ESX01 and still booting up

[14:46] - ESX02 is up & running: load-balancing VM between hosts

[15:13] - Services are up & running ; testing the network failover

[15:20] - Incident closed. DNS switchback will be initiated in another three hours.

Root Cause Analysis

The incident was due to a wiring fault; labels were missing on network cables (for traffic and synchronisation between VM), rendering human error during hardware maintenance more probable. Rework of the entire network and power wiring is going to be planned. First, an audit will be performed, by the end of this month (2019-01). Depending on the results, several on-site maintenance windows will follow.

Actility Support

Related Articles
[RESOLVED] SaaS - EU outage on 2019-05-22 - #20190522[A]
Incident Description On 2019-05-22 - 03:08 AM, our SaaS-EU datacenter became unreachable. From the outside world, it looks like a complete outage. Incident information and Service impact Incident Start Time: 2019-05-22 - 03:08 AM. ...
[RESOLVED] - SaaS-EU (Paris) GUI Access degradation on 2019/07/01 for TWA applications - #20190701A
Incident Description On 2019-07-01 - 04:35 AM (GMT), the SaaS-EU had an issue when accessing the GUI, using OSS API and using DX API. Because it was a random issue without malfunction or failed stop of a server, there was no switch-over ...
[RESOLVED] - SaaS-EU (PAris) GUI Access degradation on 2019/05/30 for TWA applications - #20190530A
Incident Description On 2019-30-05 - 03:29 AM (CEST), the SaaS-EU had a random issue when accessing the GUI, using OSS and using DX API. Because it was a random issue without malfunction or fail stop of a server, there was no switchover ...
[RESOLVED] SaaS - EU (Paris) outage on 2019/03/05 - #20190305A
Incident Description On 3/5/2019 11:30:00 GMT an incident alarm was raised on Actility EU SaaS plaform. The following component was impacted: TWA, SMP and DX-API Incident information and Service impact Incident Start Time : 3/5/2019 11:30:00 PM ...
[RESOLVED] - OVH (our datacenter) outage on 2017/11/09 - #201701109A
Since the 2017/11/09 07:43 AM (CET) our datacenter at OVH is unreachable and this impact big part of Actility services. Actility services are accessible from certain locations/networks. Issues are located on some global routers by OVH (out of our ...

[RESOLVED] SaaS - EU (Paris) outage on 2019/01/15 - #20190115B

[RESOLVED] SaaS - EU (Paris) outage on 2019/01/15 - #20190115B

Current state

Incident information and Service impact

Timeline (2019-01-15, UTC)

Root Cause Analysis

Related Articles

[RESOLVED] SaaS - EU outage on 2019-05-22 - #20190522[A]

[RESOLVED] - SaaS-EU (Paris) GUI Access degradation on 2019/07/01 for TWA applications - #20190701A

[RESOLVED] - SaaS-EU (PAris) GUI Access degradation on 2019/05/30 for TWA applications - #20190530A

[RESOLVED] SaaS - EU (Paris) outage on 2019/03/05 - #20190305A

[RESOLVED] - OVH (our datacenter) outage on 2017/11/09 - #201701109A