[RESOLVED] - SaaS-EU (Paris) GUI Access degradation on 2019/06/20 for TWA applications - #20190620A
Incident Description
On 2019-20-06
- 11:50 GMT, SaaS-EU had an issue when accessing GUI, using OSS API, DX API
- Incident Start Time: 2019-06-20 - 11:50 GMT
- Restoration Time: 2019-06-20 - 12:40 GMT
- Service(s) Impact Duration: 43 minutes.
- Severe service impact: OSS (GUI and API), DX.
- Impacted service : part of packets in Wlogger were lost and might be partially recovered manually
- Non impacted services : Network Server (LRC).
Timeline (GMT)
- [ 2019-06-20 - 11:50 ]: Monitoring probes detected timeouts on ThingPark GUI/API
- [ 2019-06-20 - 11:55 ]: On call engineer connected to the platform to investigate the issue.
- It was found that GUI and OSS problems were related to MongoDB timeout, related to indices which lock DB
- [ 2019-06-20 - 12:19 ]: MongoDB reboot - reset all current operations
- TWA and SMP were still overloaded
- [ 2019-06-20 - 12:22 ]: Disable traffic from LRC-1 to TWA
- [ 2019-06-20 - 12:25 ]: Update number of opened files on SMP and TWA VM
- [ 2019-06-20 - 12:31 ] : Reboot both TWA
- [ 2019-06-20 - 12:33 ] : Re-enable traffic from LRC to TWA
- [ 2019-06-20 - 12:40 ] : Status back to nominal - All alarms cleared.
Root Cause Analysis
The root cause was adding a MongoDB index on the platfom in background, while in MongoDB official documentation shows two syntaxes for the background mode, it seems that only one of both syntaxes works properly in background mode, which generated Database locks on MongoDB, thus generating MongoDB load, which led to timeouts and OSS unavailability.
Incident Followup
- Restore lost Wlogger packets before 2019-06-26 17:00 GMT
- Improve MongoDB load monitoring
Actility Support
Related Articles
[RESOLVED] - SaaS-EU (PAris) GUI Access degradation on 2019/05/30 for TWA applications - #20190530A
Incident Description On 2019-30-05 - 03:29 AM (CEST), the SaaS-EU had a random issue when accessing the GUI, using OSS and using DX API. Because it was a random issue without malfunction or fail stop of a server, there was no switchover ...
[RESOLVED] - SaaS-EU (Paris) GUI Access degradation on 2019/07/01 for TWA applications - #20190701A
Incident Description On 2019-07-01 - 04:35 AM (GMT), the SaaS-EU had an issue when accessing the GUI, using OSS API and using DX API. Because it was a random issue without malfunction or failed stop of a server, there was no switch-over ...
[RESOLVED] - AU SaaS GUI Access degradation on 2019/04/16 for TWA applications
AU SaaS GUI Access degradation on 2019/04/16 for TWA applications On the 2019/04/16 our AU datacenter is experiencing GUI access service degradation since 05:30 AM Singapore Time (9:30 PM UTC) the GUI access is unreachable through regular Tab ...
[RESOLVED] SaaS - EU (Paris) outage on 2019/03/05 - #20190305A
Incident Description On 3/5/2019 11:30:00 GMT an incident alarm was raised on Actility EU SaaS plaform. The following component was impacted: TWA, SMP and DX-API Incident information and Service impact Incident Start Time : 3/5/2019 11:30:00 PM ...
[RESOLVED] SaaS- EU (Paris) outage on 2018/06/08 - #20180608A
Since Friday, June 8, 2018 - 3:15 PM (CEST) some network problems are detected on site A. Actility technical teams are working on this issue Incident information and Service impact Incident Start Date and Time: Friday, June 8th at 03:15 PM (CEST). ...