[RESOLVED] - SaaS-EU (PAris) GUI Access degradation on 2019/05/30 for TWA applications - #20190530A
Incident Description
On 2019-30-05 - 03:29 AM (CEST), the SaaS-EU had a random issue when accessing the GUI, using OSS and using DX API.
Because it was a random issue without malfunction or fail stop of a server, there was no switchover between sites.
- Incident Start Time: 2019-30-05 - 03:29 AM ( CEST ).
- Restoration Time: 2019-30-05 - 04:04 AM ( CEST ).
- Service(s) Impact Duration: 10 minutes.
- Severe service impact: OSS (GUI and API), DX.
- Non impacted services : Network Server (LRC).
Timeline (CEST)
- [ 2019-30-05 - 03:29 AM ] : Monitoring probes detected timeouts on monitoring requests
- [ 2019-30-05 - 03:35 AM ] : On call engineer connected to the platform to investigate the issue.
- [ 2019-30-05 - 03:39 AM ] : New alarms on DX API server (A-DX-STATELESS-02) related memory usage.
- [ 2019-30-05 - 03:47 AM ] : DX service was restarted and probes executed manually was OK, this cleared alarms at 03:48 AM.
- [ 2019-30-05 - 03:49 AM ] : Investigations continued to check if there was a load issue but no proof found.
- [ 2019-30-05 - 03:56 AM ] : New set of alarms on TWA, SMP and DX. Probes executed manually on a-SMP-01 was KO.
- [ 2019-30-05 - 04:01 AM ] : Restart of SMP on a-smp-01.
- [ 2019-30-05 - 04:01 AM ] : All alarms cleared.
Root Cause Analysis
The main root cause is a load issue on SMP and Proxy.
We are currently investigating to find the cause of this overload by:
- analyzing logs we gather from both servers.
- analyzing traffic caught by our traffic monitoring probes
We update a limit in the widfly server to better handle the load. This action was performed on May 31st.
Actility Support