[RESOLVED] - LRC outage for SaaS-EU on 2020-04-25

[RESOLVED] - LRC outage for SaaS-EU on 2020-04-25

On 2020-04-25, our LRC end-to-end probe detects a traffic issue. The alarm is automatically cleared within five minutes: the LRC process was restarted after catastrophic memory failure.

Incident information and Timeline (UTC)

  • [summary] The problem was first found by the on-call engineer on 2020-04-25T13:07Z.
    • Degraded mode during approximately one hour, until full LRC failure.
    • Impact from 12:00 to 13:09.

  • A big customer on the SaaS platform faced a complete ~2-hr LRR gateways backhaul outage, on their side, from 09:15 to 11:45.
    • [2020-04-25T11:45Z] When their backhaul was restored, all their LRR gateways reconnected to the LRCs and started dequeueing late packets buffered during their network unavailability. 
    • [2020-04-25T11:50Z] This caused a traffic surge, on both the LRC and their external AS, rendering them less responsive. This led to quick automated rate-limiting measures in an attempt to protect the rest of the traffic but service was increasingly degraded until abnormal memory consumption levels were reached.
    • [2020-04-25T13:09Z] The process was then restarted at 13:09, resuming therefore normal operation
  • Actility is working on a solution to prevent such packet processing issue in case the same situation happens again.

Service Impact

  • Severely Impacted Services: Routing packet services at LRC level. Packets coming from devices and AS were no longer routed by the LRC. 

Root Cause Analysis   

This section will evolve as soon as we know more about the LRC behaviour under such circumstances.