[RESOLVED] - LRC outage for SaaS-EU on 2020-04-25
On 2020-04-25, our LRC end-to-end probe detects a traffic issue. The alarm is automatically cleared within five minutes: the LRC process was restarted after catastrophic memory failure.
- [summary] The problem was first found by the on-call engineer on 2020-04-25T13:07Z. 
 
- Degraded mode during approximately one hour, until full LRC failure.
 
- Impact from 12:00 to 13:09.
 
- A big customer on the SaaS platform faced a complete ~2-hr LRR gateways backhaul outage, on their side, from 09:15 to 11:45.
 
- [2020-04-25T11:45Z] When their backhaul was restored, all their LRR gateways reconnected to the LRCs and started dequeueing late packets buffered during their network unavailability. 
 
- [2020-04-25T11:50Z] This caused a traffic surge, on both the LRC and their external AS, rendering them less responsive. This led to quick automated rate-limiting measures in an attempt to protect the rest of the traffic but service was increasingly degraded until abnormal memory consumption levels were reached.
 
- [2020-04-25T13:09Z] The process was then restarted at 13:09, resuming therefore normal operation
- Actility is working on a solution to prevent such packet processing issue in case the same situation happens again.
Service Impact
- Severely Impacted Services: Routing packet services at LRC level. Packets coming from devices and AS were no longer routed by the LRC. 
Root Cause Analysis   
This section will evolve as soon as we know more about the LRC behaviour under such circumstances.
- Related Articles
- [RESOLVED] - LRC outage for SaaS- EU from 2018/04/12 - 2018/05- Since April 12, 2018, our LRC probe detects from time to time a lack of traffic; packets coming from gateways are no longer forwarded to AS. Initial incidents were detected by the on-call engineer who gathers enough information to find the root ... 
- [RESOLVED] SaaS - EU outage on 2019-05-22 - #20190522[A]-       Incident Description  On  2019-05-22 - 03:08 AM, our SaaS-EU datacenter became unreachable.  From the outside world, it looks like a complete outage. Incident information and Service impact Incident Start Time: 2019-05-22 - 03:08 AM. ... 
- [RESOLVED] SaaS - EU (Paris) outage on 2019/01/15 - #20190115B- On 2019-01-15T01:25Z, following planned hardware maintenance and incident #20190115A, site B went offline. Actility technical teams are working on this issue. Current state Resolved. Incident information and Service impact Incident Start Time: ... 
- [RESOLVED] SaaS- EU (Paris) outage on 2018/06/08 - #20180608A- Since Friday, June 8, 2018 - 3:15 PM (CEST) some network problems are detected on site A. Actility technical teams are working on this issue  Incident information and Service impact Incident Start Date and Time: Friday, June 8th at 03:15 PM (CEST). ... 
- [RESOLVED]- SaaS- EU (Paris) outage on 2018/06/06 - #20180606A- Since Wednesday, June 6, 2018 - 5:38 PM (CEST) some network problems are detected on site A. Actility technical teams are working on this issue  Incident information and Service impact Incident Start Date and Time: Sunday, June 6th at 05:38 PM ...