SaaS - EU (Paris) outage on 2019/03/05 - 20190305A

[RESOLVED] SaaS - EU (Paris) outage on 2019/03/05 - #20190305A

 

 

 


Incident Description

On 3/5/2019 11:30:00 GMT an incident alarm was raised on Actility EU SaaS plaform.
The following component was impacted: TWA, SMP and DX-API

Incident information and Service impact

  • Incident Start Time : 3/5/2019 11:30:00 PM GMT
  • Service Restoration Time: 3/6/2019 02:10:00 AM GMT
  • Service Impact Duration : Severe performance issue on database during 160 minutes with a failure rate of more than 50% on requests to the service listed below.
  • Severe service impact : Production GUI administration, DX-API, OSS API (Downlinks, Provisioning), ThingPark-EXchange.
  • Non impacted services : Devices packets uplinks to Application Servers.

Timeline

  • [3/5/2019 11:30:00 PM GMT] : Alarm Triggered
  • [3/5/2019 11:40:00 PM GMT] : Restart attempts of impacted services
  • [3/6/2019 00:40:00 AM GMT] : Identification of the faulty users
  • [3/6/2019 00:50:00 AM GMT] : More investigation on the logs
  • [3/6/2019 02:10:00 AM GMT] : Back to normal load

Root Cause Analysis

The load was provoked by one of our customers sending a lot of downlinks during this period:
Under the load, the database started to be overloaded and many requests timeds out.

Even if the platform is correctly scaled, the overload control is very basic especially at DX level.
Preventive action will be to add per-IP requests limit at DX level, and long term action to refactor DX architecture (To be validated by Product team)