SaaS-EU (Paris) GUI Access degradation on 2019/07/01 for TWA applications

[RESOLVED] - SaaS-EU (Paris) GUI Access degradation on 2019/07/01 for TWA applications - #20190701A




Incident Description  

On 2019-07-01 - 04:35 AM (GMT), the SaaS-EU had an issue when accessing the GUI, using OSS API and using DX API. 
Because it was a random issue without malfunction or failed stop of a server, there was no switch-over between sites.  

Incident information and Service impact

  • Incident Start Time: 2019-07-01 - 04:35 AM (GMT).
  • Restoration Time: 2019-07-01 - 05:06 AM (GMT). 
  • Service(s) Impact Duration: 31 minutes.
  • Severe service impact: OSS (GUI and API), DX. 
  • Non impacted services : Network Server (LRC).

Timeline (GMT) 

  • 2019-07-01 - 04:35 AM : Monitoring probes detected connection issue on GUI 
  • 2019-07-01 - 04:36 AM : On-call engineer acknowledged the alarm.
  • 2019-07-01 - 04:36 AM : New alarms on DX API server (B-DXAPI-E2E Roundtrip).
  • 2019-07-01 - 04:40 AM : New alarms on DX API server (A-DXAPI-E2E Roundtrip). 
  • 2019-07-01 - 04:51 AM : New alarm on SMP server (SMP FD COUNT), Investigations point to SMP.   
  • 2019-07-01 - 04:56 AM : DX servers were re-started. The issue remained. 
  • 2019-07-01 - 05:01 AM : Restart of SMP (A then B). 
  • 2019-07-01 - 05:06 AM : All alarms cleared.  
We apologizes for not having post an article on status portal this time.
As part of our quality improvement we will update our on-call procedure to update prior our investigations.

Root Cause Analysis

The main root cause is a load issue on A-SMP server.
We are currently investigating to find the cause of this overload by:
  1. analyzing logs we gather from both DX servers,
  2. analyzing traffic caught by our traffic monitoring probes,
  3. we suspect extra load from DX-API to SMP to be the root cause of this overload. 

Actility Support