[RESOLVED]- SaaS - EQX EU (Paris) outage on 2019/03/15 - #20190315A

[RESOLVED]- SaaS - EQX EU (Paris) outage on 2019/03/15 - #20190315A

On 2019/03/15 02:10 GMT an incident on Equinix network backbone on site PA3 occured and impacted ThingPark Wireless GUI, OSS API (Downlinks, Provisioning), ThingPark-EXchange

Current State

Operational service

Incident information and Service impact

  • Incident Start Time: 2019-03-15 02:10 GMT
  • Service Restoration Time: 2019-03-15 05:00 GMT
  • Service Impact Duration: 2h50 minutes on the impacted services
  • Severe service impact: Production GUI administration, OSS API (Downlinks, Provisioning), ThingPark-EXchange
  • Non impacted services : Devices packets uplinks to Application Servers (AS), Developer Experience (DX)
  • Incident Start Time: 2019-03-15 2:10 GMT
  • Service Restoration Time: 2019-03-15 08:00 GMT
  • Service(s) Impact Duration: 5h50 minutes 
  • Severe service impact: Pre-production GUI and OSS API
  • Non impacted services : Devices packets uplinks to Application Serviers (AS), Developer Experience (DX)
Please note that status page (status.thingpark.com) was not accessible by usual https url, but only http. 
This might be an independent issue that is currently under investigation

Timeline (March 15th)

[2019-03-14 21:00 GMT] : Maintenance Operation on SaaS Equinix network backbone - only 20 minutes of redundancy loss for Internet access was estimated by subcontractor
[2019-03-15 02:01 GMT] : Impact start
[2019-03-15 02:10 GMT] : Incident Start time on Actility side
[2019-03-15 02:31 GMT] : Incident discovered and start time on Subcontractor side
[2019-03-15 02:55 GMT] : PA3 site found unreachable => Actility internal escalation
[2019-03-15 03:20 GMT] : PA3 site temporarily restored
[2019-03-15 03:40 GMT] : PA3 site down again
[2019-03-15 4:39 GMT] : Manual switch of DNS from PA4 to PA3 
Production service restored depending on DNS propagation time (from 0 minutes to ~3 hours)

[2019-03-15 08:00 GMT] : Preproduction manual DNS switch from PA3 to PA4
[2019-03-15 14:30 GMT] : Subcontractor confirmation about PA3 stability.
[2019-03-15 14:50 GMT] : Switch back DNS from PA4 to PA3 
[2019-03-15 14:50 GMT] : End of incident

Root Cause Analysis

A maintenance Operation was planned on Equinix network backbone by Subcontractor where 20 minutes of redundancy loss for Internet access was estimated.
This operation caused site PA3 to be not reachable.

The root Cause for this unexpected PA3 unavailability is not yet known - waiting for Subcontractor and Equinix Tier 2 team feedback

On Actility side, incident was handled quickly, but decision making for DNS update was slowed down by PA3 site instability (PA3 recovered between 3:20-3:40 am).

To avoid depending on manual operation in case of site unavailability, It is planned to activate GSLB on Production ASAP (Global server load balancing).