[RESOLVED] - SaaS-EU (Paris) GUI Access degradation on 2019/06/20 for TWA applications - #20190620A

[RESOLVED] - SaaS-EU (Paris) GUI Access degradation on 2019/06/20 for TWA applications - #20190620A

 

 

 

Incident Description  

On  2019-20-06   - 11:50   GMT,  SaaS-EU had an issue when accessing GUI, using OSS API, DX API 

Incident Information and Service impact    

  • Incident Start Time:  2019-06-20  - 11:50 GMT
  • Restoration Time:  2019-06-20 - 12:40 GMT
  • Service(s) Impact Duration: 43 minutes.
  • Severe service impact: OSS (GUI and API), DX.
  • Impacted service : part of packets in Wlogger were lost and might be partially recovered manually
  • Non impacted services : Network Server (LRC).

Timeline (GMT) 

  • [   2019-06-20 - 11:50 ]: Monitoring probes detected timeouts on ThingPark GUI/API 
  • [   2019-06-20 11:55 ]: On call engineer connected to the platform to investigate the issue. 
  • It was found that GUI and OSS problems were related to MongoDB timeout, related to indices which lock DB
  • [   2019-06-20 - 12:19 ]: MongoDB reboot - reset all current operations
  • TWA and SMP were still overloaded
  • 2019-06-20 - 12:22 ]:  Disable traffic from LRC-1 to TWA
  • 2019-06-20 - 12:25 ]:  Update number of opened files on SMP and TWA VM
  • 2019-06-20 - 12:31  : Reboot both TWA
  • 2019-06-20 - 12:33 ] : Re-enable traffic from LRC to TWA
  • 2019-06-20 - 12:40 ] : Status back to nominal - All alarms cleared. 

Live update was available on  status page .  

Root Cause Analysis

The root cause was adding a MongoDB index on the platfom in background, while in MongoDB official documentation shows two syntaxes for the background mode, it seems that only one of both syntaxes works properly in background mode, which generated Database locks on MongoDB, thus generating MongoDB load, which led to timeouts and OSS unavailability.

Incident Followup  

  • Restore lost Wlogger packets before 2019-06-26 17:00 GMT
  • Improve MongoDB load monitoring 

Actility Support