[RESOLVED] - LRC outage for SaaS-EU on 2020-04-25
On 2020-04-25, our LRC end-to-end probe detects a traffic issue. The alarm is automatically cleared within five minutes: the LRC process was restarted after catastrophic memory failure.
- [summary] The problem was first found by the on-call engineer on 2020-04-25T13:07Z.
- Degraded mode during approximately one hour, until full LRC failure.
- Impact from 12:00 to 13:09.
- A big customer on the SaaS platform faced a complete ~2-hr LRR gateways backhaul outage, on their side, from 09:15 to 11:45.
- [2020-04-25T11:45Z] When their backhaul was restored, all their LRR gateways reconnected to the LRCs and started dequeueing late packets buffered during their network unavailability.
- [2020-04-25T11:50Z] This caused a traffic surge, on both the LRC and their external AS, rendering them less responsive. This led to quick automated rate-limiting measures in an attempt to protect the rest of the traffic but service was increasingly degraded until abnormal memory consumption levels were reached.
- [2020-04-25T13:09Z] The process was then restarted at 13:09, resuming therefore normal operation
- Actility is working on a solution to prevent such packet processing issue in case the same situation happens again.
Service Impact
- Severely Impacted Services: Routing packet services at LRC level. Packets coming from devices and AS were no longer routed by the LRC.
Root Cause Analysis
This section will evolve as soon as we know more about the LRC behaviour under such circumstances.
Related Articles
[RESOLVED] - LRC outage for SaaS- EU from 2018/04/12 - 2018/05
Since April 12, 2018, our LRC probe detects from time to time a lack of traffic; packets coming from gateways are no longer forwarded to AS. Initial incidents were detected by the on-call engineer who gathers enough information to find the root ...
[RESOLVED] SaaS - EU outage on 2019-05-22 - #20190522[A]
Incident Description On 2019-05-22 - 03:08 AM, our SaaS-EU datacenter became unreachable. From the outside world, it looks like a complete outage. Incident information and Service impact Incident Start Time: 2019-05-22 - 03:08 AM. ...
[RESOLVED] - SaaS-EU (Paris) GUI Access degradation on 2019/07/01 for TWA applications - #20190701A
Incident Description On 2019-07-01 - 04:35 AM (GMT), the SaaS-EU had an issue when accessing the GUI, using OSS API and using DX API. Because it was a random issue without malfunction or failed stop of a server, there was no switch-over ...
[RESOLVED] - SaaS-EU (PAris) GUI Access degradation on 2019/05/30 for TWA applications - #20190530A
Incident Description On 2019-30-05 - 03:29 AM (CEST), the SaaS-EU had a random issue when accessing the GUI, using OSS and using DX API. Because it was a random issue without malfunction or fail stop of a server, there was no switchover ...
[RESOLVED] - OVH (our datacenter) outage on 2017/11/09 - #201701109A
Since the 2017/11/09 07:43 AM (CET) our datacenter at OVH is unreachable and this impact big part of Actility services. Actility services are accessible from certain locations/networks. Issues are located on some global routers by OVH (out of our ...