[RESOLVED] SaaS - EU (Paris) outage on 2019/01/15 - #20190115B

[RESOLVED] SaaS - EU (Paris) outage on 2019/01/15 - #20190115B

On 2019-01-15T01:25Z, following planned hardware maintenance and incident #20190115A, site B went offline.
Actility technical teams are working on this issue.

Current state


Resolved.

Incident information and Service impact



  • Incident Start Time: 2019-01-15T13:25Z.
  • Service Restoration Time: 2019-01-15T15:20Z. 
  • Service(s) Impact Duration: 1 hours 55 minutes on the Network Server
  • Severe service impact: OSS (GUI and API) is not responsive. Provisioning and downlinks from external AS fail. Provisioning through DX-API works. 
  • Little service impact: Missing Uplink/Downlink packets if gateways not connected to primary LRC (site A).
  • No Service impact: end-to-end uplink communication through primary LRC (site A).

Timeline (2019-01-15, UTC)


The status is updated live on the status page.
[13:25] - Starting planned hardware maintenance on ESX01 virtualisation platform
[13:29] - NetOps team notices some VM are out (services are unstable)
[13:34] - NetOps team confirms all VM are up & running on previously upgraded ESX02 (cf. #20190115A) but GUI are still out (diagnosis in progress on ESX02)
[13:38] - NetOps/IS&T teams escalate to Commissioning subcontractor regarding a potential network issue
[13:47] - ESX01 back online; GUI are still out
[14:01] - NetOps prepares the DNS switchover (from site B to site A) but does not fire it
[14:11] - Commissioning subcontractor acknowledges a network issue: starting network debug with on-site actions (IS&T)
[14:17] - Network issue confirmed: need to test the whole wiring
[14:29] - Network issue solved but GUI are still out
[14:33] - DNS switch triggered
[14:40] - Rebooting ESX02
[14:46] - VM are partially back on ESX01 and still booting up
[14:46] - ESX02 is up & running: load-balancing VM between hosts
[15:13] - Services are up & running ; testing the network failover
[15:20] - Incident closed. DNS switchback will be initiated in another three hours.

Root Cause Analysis



The incident was due to a wiring fault; labels were missing on network cables (for traffic and synchronisation between VM), rendering human error during hardware maintenance more probable. Rework of the entire network and power wiring is going to be planned. First, an audit will be performed, by the end of this month (2019-01). Depending on the results, several on-site maintenance windows will follow.

Actility Support