[RESOLVED] - SaaS-EU (Paris) GUI Access degradation on 2019/07/01 for TWA applications - #20190701A
Incident Description
On 2019-07-01 - 04:35 AM (GMT), the SaaS-EU had an issue when accessing the GUI, using OSS API and using DX API.
Because it was a random issue without malfunction or failed stop of a server, there was no switch-over between sites.
- Incident Start Time: 2019-07-01 - 04:35 AM (GMT).
- Restoration Time: 2019-07-01 - 05:06 AM (GMT).
- Service(s) Impact Duration: 31 minutes.
- Severe service impact: OSS (GUI and API), DX.
- Non impacted services : Network Server (LRC).
Timeline (GMT)
- [ 2019-07-01 - 04:35 AM ] : Monitoring probes detected connection issue on GUI
- [ 2019-07-01 - 04:36 AM ] : On-call engineer acknowledged the alarm.
- [ 2019-07-01 - 04:36 AM ] : New alarms on DX API server (B-DXAPI-E2E Roundtrip).
- [ 2019-07-01 - 04:40 AM ] : New alarms on DX API server (A-DXAPI-E2E Roundtrip).
- [ 2019-07-01 - 04:51 AM ] : New alarm on SMP server (SMP FD COUNT), Investigations point to SMP.
- [ 2019-07-01 - 04:56 AM ] : DX servers were re-started. The issue remained.
- [ 2019-07-01 - 05:01 AM ] : Restart of SMP (A then B).
- [ 2019-07-01 - 05:06 AM ] : All alarms cleared.
We apologizes for not having post an article on status portal this time.
As part of our quality improvement we will update our on-call procedure to update
status.thingpark.com prior our investigations.
Root Cause Analysis
The main root cause is a load issue on A-SMP server.
We are currently investigating to find the cause of this overload by:
- analyzing logs we gather from both DX servers,
- analyzing traffic caught by our traffic monitoring probes,
- we suspect extra load from DX-API to SMP to be the root cause of this overload.
Actility Support