[RESOLVED]- SaaS - EQX EU (Paris) outage on 2019/03/15 - #20190315A

On 2019/03/15 02:10 GMT an incident on Equinix network backbone on site PA3 occured and impacted ThingPark Wireless GUI, OSS API (Downlinks, Provisioning), ThingPark-EXchange

Current State

Operational service

Incident information and Service impact

Incident Start Time: 2019-03-15 02:10 GMT
Service Restoration Time: 2019-03-15 05:00 GMT
Service Impact Duration: 2h50 minutes on the impacted services
Severe service impact: Production GUI administration, OSS API (Downlinks, Provisioning), ThingPark-EXchange
Non impacted services : Devices packets uplinks to Application Servers (AS), Developer Experience (DX)

Preproduction

Incident Start Time: 2019-03-15 2:10 GMT
Service Restoration Time: 2019-03-15 08:00 GMT
Service(s) Impact Duration: 5h50 minutes
Severe service impact: Pre-production GUI and OSS API
Non impacted services : Devices packets uplinks to Application Serviers (AS), Developer Experience (DX)

Please note that status page (status.thingpark.com) was not accessible by usual https url, but only http.
This might be an independent issue that is currently under investigation

Timeline (March 15th)

[2019-03-14 21:00 GMT] : Maintenance Operation on SaaS Equinix network backbone - only 20 minutes of redundancy loss for Internet access was estimated by subcontractor

[2019-03-15 02:01 GMT] : Impact start

[2019-03-15 02:10 GMT] : Incident Start time on Actility side

[2019-03-15 02:31 GMT] : Incident discovered and start time on Subcontractor side

[2019-03-15 02:55 GMT] : PA3 site found unreachable => Actility internal escalation

[2019-03-15 03:20 GMT] : PA3 site temporarily restored

[2019-03-15 03:40 GMT] : PA3 site down again

[2019-03-15 4:39 GMT] : Manual switch of DNS from PA4 to PA3

Production service restored depending on DNS propagation time (from 0 minutes to ~3 hours)

[2019-03-15 08:00 GMT] : Preproduction manual DNS switch from PA3 to PA4

[2019-03-15 14:30 GMT] : Subcontractor confirmation about PA3 stability.

[2019-03-15 14:50 GMT] : Switch back DNS from PA4 to PA3

[2019-03-15 14:50 GMT] : End of incident

Root Cause Analysis

A maintenance Operation was planned on Equinix network backbone by Subcontractor where 20 minutes of redundancy loss for Internet access was estimated.
This operation caused site PA3 to be not reachable.

The root Cause for this unexpected PA3 unavailability is not yet known - waiting for Subcontractor and Equinix Tier 2 team feedback

On Actility side, incident was handled quickly, but decision making for DNS update was slowed down by PA3 site instability (PA3 recovered between 3:20-3:40 am).

To avoid depending on manual operation in case of site unavailability, It is planned to activate GSLB on Production ASAP (Global server load balancing).

Related Articles
[RESOLVED] - SaaS-EU (Paris) GUI Access degradation on 2019/07/01 for TWA applications - #20190701A
Incident Description On 2019-07-01 - 04:35 AM (GMT), the SaaS-EU had an issue when accessing the GUI, using OSS API and using DX API. Because it was a random issue without malfunction or failed stop of a server, there was no switch-over ...
[RESOLVED] - SaaS-EU (PAris) GUI Access degradation on 2019/05/30 for TWA applications - #20190530A
Incident Description On 2019-30-05 - 03:29 AM (CEST), the SaaS-EU had a random issue when accessing the GUI, using OSS and using DX API. Because it was a random issue without malfunction or fail stop of a server, there was no switchover ...
[RESOLVED] SaaS - EU outage on 2019-05-22 - #20190522[A]
Incident Description On 2019-05-22 - 03:08 AM, our SaaS-EU datacenter became unreachable. From the outside world, it looks like a complete outage. Incident information and Service impact Incident Start Time: 2019-05-22 - 03:08 AM. ...
[RESOLVED] - LRC outage for SaaS-EU on 2020-04-25
On 2020-04-25, our LRC end-to-end probe detects a traffic issue. The alarm is automatically cleared within five minutes: the LRC process was restarted after catastrophic memory failure. Incident information and Timeline (UTC) [summary] The problem ...
[RESOLVED] - OVH (our datacenter) outage on 2017/11/09 - #201701109A
Since the 2017/11/09 07:43 AM (CET) our datacenter at OVH is unreachable and this impact big part of Actility services. Actility services are accessible from certain locations/networks. Issues are located on some global routers by OVH (out of our ...

[RESOLVED]- SaaS - EQX EU (Paris) outage on 2019/03/15 - #20190315A

[RESOLVED]- SaaS - EQX EU (Paris) outage on 2019/03/15 - #20190315A

Current State

Incident information and Service impact

Timeline (March 15th)

Root Cause Analysis

Related Articles

[RESOLVED] - SaaS-EU (Paris) GUI Access degradation on 2019/07/01 for TWA applications - #20190701A

[RESOLVED] - SaaS-EU (PAris) GUI Access degradation on 2019/05/30 for TWA applications - #20190530A

[RESOLVED] SaaS - EU outage on 2019-05-22 - #20190522[A]

[RESOLVED] - LRC outage for SaaS-EU on 2020-04-25

[RESOLVED] - OVH (our datacenter) outage on 2017/11/09 - #201701109A