SaaS-EU GUI Access degradation on 2019/05/30 for TWA applications

[RESOLVED] - SaaS-EU (PAris) GUI Access degradation on 2019/05/30 for TWA applications - #20190530A

 

 

 


Incident Description  

On  2019-30-05 - 03:29 AM (CEST), the SaaS-EU had a random issue when accessing the GUI, using OSS and using DX API. 
Because it was a random issue without malfunction or fail stop of a server, there was no switchover between sites.  

Incident information and Service impact

  • Incident Start Time: 2019-30-05 - 03:29 AM ( CEST ).
  • Restoration Time: 2019-30-05 - 04:04 AM ( CEST ). 
  • Service(s) Impact Duration: 10 minutes.
  • Severe service impact: OSS (GUI and API), DX. 
  • Non impacted services : Network Server (LRC).

Timeline (CEST) 

  • 2019-30-05 - 03:29 AM  : Monitoring probes detected timeouts on monitoring requests
  • 2019-30-05 - 03:35 AM   : On call engineer connected to the platform to investigate the issue.
  • 2019-30-05 - 03:39 AM  : New alarms on DX API server (A-DX-STATELESS-02) related memory usage.
  • 2019-30-05 - 03:47 AM  : DX service was restarted and probes executed manually was OK, this cleared alarms at 03:48 AM. 
  • [ 2019-30-05 - 03:49 AM   : Investigations continued to check if there was a load issue but no proof found. 
  • [ 2019-30-05 - 03:56 AM   ]   : New set of alarms on TWA, SMP and DX. Probes executed manually on a-SMP-01 was KO. 
  • [ 2019-30-05 - 04:01 AM   ]   : Restart of SMP on a-smp-01. 
  • [ 2019-30-05 - 04:01 AM   ]  :   All alarms cleared. 


This timeline was updated live on our status page .  

Root Cause Analysis

The main root cause is a load issue on SMP and Proxy.
We are currently investigating to find the cause of this overload by:
  1. analyzing logs we gather from both servers.
  2. analyzing traffic caught by our traffic monitoring probes 


We update a limit in the widfly server to better handle the load. This action was performed on May 31st.

Actility Support