LRC outage for Equinix SaaS from 2018/04/12 - 2018/05/02

[RESOLVED] - LRC outage for SaaS- EU from 2018/04/12 - 2018/05

Since April 12, 2018, our LRC probe detects from time to time a lack of traffic; packets coming from gateways are no longer forwarded to AS.
Initial incidents were detected by the on-call engineer who gathers enough information to find the root cause.
To limit the impact of these incidents, a script was developed to
 automatically restart the LRC  to recover from the problem. 

Incident information and Timeline

  • The problem was first found by the on-call engineer on April 12, 2018. Impact from 02:39 AM to 03:29 AM CEST: 50 minutes
  • Actility developed a script which detects the incident and gathers debug traces.
  • Incidents happened on Apr 15, 2018, which LRC had to be restarted, impact from 02:55 AM to 03:12 AM CEST: 17 minutes 
  • Actility updated LRC script so in failure the service was automatically restarted, which was leading to an average impact duration of 3 minutes approximately.
  • Incidents happened on which was automatically restarted by the script:
    • Wed Apr 18 04:56:46 CEST 2018
    • Sat Apr 21 12:47:46 CEST 2018
    • Mon Apr 23 03:57:46 CEST 2018
    • Fri Apr 27 10:15:47 CEST 2018
    • Fri Apr 27 14:02:46 CEST 2018 
    • Sat Apr 28 04:55:22 CEST 2018  
    • Mon Apr 30 21:31:23 CEST 2018
    • Tue May 01 00:54:18 CEST 2018
    • Thu May  3 21:58:47 CEST 2018
    • Sat May  5 21:14:46 CEST 2018
    • Tue May  8 03:58:46 CEST 2018
    • Thu May 10 02:49:47 CEST 2018
    • Thu May 10 05:12:46 CEST 2018
    • Thu May 10 16:07:46 CEST 2018
    • Fri May 11 05:17:46 CEST 2018
    • Fri May 11 18:46:46 CEST 2018
    • Sun May 13 13:38:46 CEST 2018
    • Mon May 14 04:49:47 CEST 2018
    • Mon May 14 06:50:46 CEST 2018
    • Tue May 15 12:09:47 CEST 2018
    • Tue May 15 15:28:46 CEST 2018
    • Wed May 16 11:28:47 CEST 2018
    • Fri  May 18 14:31:46 CEST 2018
    • Sat May 19 03:00:46 CEST 2018
    • Thu May 24 17:52:46 CEST 2018
    • Thu May 24 22:45:46 CEST 2018
    • Sun May 27 07:42:47 CEST 2018
    • Sun May 27 19:28:46 CEST 2018
    • Sun May 27 22:57:46 CEST 2018

Service Impact

  • Severely Impacted Services: Routing packet services at LRC level. Packets coming from devices and AS were no longer routed by the LRC.
  • Low Impacted Services: Not applicable
  • Non-Impacted Services: All but the packet routing service at LRC.

Root Cause Analysis  


 A bug was discovered on the LRC component which handles communication with AS.  
This bug generates a high CPU consumption which leads to a packet routing failure at LRC level.  
R&D managed to reproduce the issue with the same effects.
It has been highlighted that one of the functions degrade and cause an inside loop. The solution was bypassing this function and use a new custom one.
Bug correction is estimated for LRC version 1.10.37 that will be released in the coming weeks based on tests results.