[RESOLVED] SaaS - EU (Paris) outage on 2019/03/05 - #20190305A
Incident Description
On 3/5/2019 11:30:00 GMT an incident alarm was raised on Actility EU SaaS plaform.
The following component was impacted: TWA, SMP and DX-API
- Incident Start Time : 3/5/2019 11:30:00 PM GMT
- Service Restoration Time: 3/6/2019 02:10:00 AM GMT
- Service Impact Duration : Severe performance issue on database during 160 minutes with a failure rate of more than 50% on requests to the service listed below.
- Severe service impact : Production GUI administration, DX-API, OSS API (Downlinks, Provisioning), ThingPark-EXchange.
- Non impacted services : Devices packets uplinks to Application Servers.
Timeline
- [3/5/2019 11:30:00 PM GMT] : Alarm Triggered
- [3/5/2019 11:40:00 PM GMT] : Restart attempts of impacted services
- [3/6/2019 00:40:00 AM GMT] : Identification of the faulty users
- [3/6/2019 00:50:00 AM GMT] : More investigation on the logs
- [3/6/2019 02:10:00 AM GMT] : Back to normal load
Root Cause Analysis
The load was provoked by one of our customers sending a lot of downlinks during this period:
Under the load, the database started to be overloaded and many requests timeds out.
Even if the platform is correctly scaled, the overload control is very basic especially at DX level.
Preventive action will be to add per-IP requests limit at DX level, and long term action to refactor DX architecture (To be validated by Product team)
Related Articles
[RESOLVED] - SaaS-EU (Paris) GUI Access degradation on 2019/07/01 for TWA applications - #20190701A
Incident Description On 2019-07-01 - 04:35 AM (GMT), the SaaS-EU had an issue when accessing the GUI, using OSS API and using DX API. Because it was a random issue without malfunction or failed stop of a server, there was no switch-over ...
[RESOLVED] - SaaS-EU (PAris) GUI Access degradation on 2019/05/30 for TWA applications - #20190530A
Incident Description On 2019-30-05 - 03:29 AM (CEST), the SaaS-EU had a random issue when accessing the GUI, using OSS and using DX API. Because it was a random issue without malfunction or fail stop of a server, there was no switchover ...
[RESOLVED] SaaS - EU outage on 2019-05-22 - #20190522[A]
Incident Description On 2019-05-22 - 03:08 AM, our SaaS-EU datacenter became unreachable. From the outside world, it looks like a complete outage. Incident information and Service impact Incident Start Time: 2019-05-22 - 03:08 AM. ...
[RESOLVED] SaaS - EU (Paris) outage on 2019/01/15 - #20190115B
On 2019-01-15T01:25Z, following planned hardware maintenance and incident #20190115A, site B went offline. Actility technical teams are working on this issue. Current state Resolved. Incident information and Service impact Incident Start Time: ...
[RESOLVED] - OVH (our datacenter) outage on 2017/11/09 - #201701109A
Since the 2017/11/09 07:43 AM (CET) our datacenter at OVH is unreachable and this impact big part of Actility services. Actility services are accessible from certain locations/networks. Issues are located on some global routers by OVH (out of our ...