[RESOLVED] - OVH (our datacenter) outage on 2017/11/09 - #201701109A

[RESOLVED] - OVH (our datacenter) outage on 2017/11/09 - #201701109A

Since the 2017/11/09 07:43 AM (CET) our datacenter at OVH is unreachable and this impact big part of Actility services.
Actility services are accessible from certain locations/networks. Issues are located on some global routers by OVH (out of our infrastructure).

Actility services impacted are:
The ThingPark Wireless (SaaS Offer) service is NOT affected by this outage because another datacenter is used.


We will keep you informed about the incident resolution as soon as we get information from OVH.

Actility Support



Timeline:



Additional information can also be found at OVH Twitter account and at OVH Support site



Root Cause Analysis (RCA):

OVH has several severe issues at two different locations: Strasbourg (East of France) and Roubaix (North of France). You can find below the RCA provided by OVH.


Roubaix:
We experienced an incident on the optical network which connects our Roubaix (RBX) site with 6 of the 33 points of presence (POPs) on our network. Paris (TH2 and GSW), Frankfurt (FRA), Amsterdam (AMS), London (LDN), Brussles (BRU). The Roubaix site is connected via 6 fibre optic cables to these 6 POPs: 2x RBX<>BRU, 2x RBX<>LDN, 2x RBX<>Paris (1x RBX<>TH2 et 1x RBX<>GSW). These 6 fibre optic cables are connected to a system of optical nodes which means each fibre optic cable can carry 80 x 100 Gbps. For each 100 G connected to the routers, we use two optical paths which are in distinct geographic locations. If any fibre optic link is cut, the system reconfigures in 50ms and all the links stay UP.
To connect RBX to our POPs, we have 4.4Tbps capacity, 44x100G: 12x 100G to Paris, 8x100G to London, 2x100G to Brussels, 8x100G to Amsterdam, 10x100G to Frankfurt, 2x100G to the GRA DC and 2x100G to SBG DC.

At 8:01, all the 100G links, 44x 100G, were lost in one go. Given that we have a redundancy system in place, the root of the problem could not be the physical shutdown of 6 optical fibres simultaneously. We could not do a remote diagnostic of the chassis because the management interfaces were not working. We had to intervene directly in the routing rooms themselves, to sort out the chassis: disconnect the cables between the chassis and restart the system and finally do the diagnostics with the equipment manufacturer. Attempts to reboot the system took a long time because each chassis needs 10 to 12 minutes to boot. This is the main reason that it the incident lasted such a long time.

Diagnostic: all the interface cards that we use, ncs2k-400g-lk9, ncs2k-200g-cklc, went into "standby" mode. This could have been due to a loss of configuration. We therefore recovered the backup and reset the configuration, which allowed the system to reconfigure all the interface cards. The 100Gs in the routers came back naturally and the RBX connection to the 6 POPs was restored at 10:34.
There is clearly a software bug on the optical equipment. The database with the configuration is saved 3 times and copied to 2 monitoring cards. Despite all these security measures, the database disappeared. We will work with the OEM to find the source of the problem and help fix the bug. We do not doubt the equipment manufacturer, even if this type of bug is particularly critical. Uptime is a matter of design that must consider every eventuality, including when nothing else works. OVH must make sure to be even more paranoid than it already is in every system that it designs.

Strasbourg:
This morning at 7:23 am, we had a major outage in our Strasbourg site (SBG): a power outage that left three datacenters without power for 3.5 hours. SBG1, SBG2 and SBG4 were impacted. This is probably the worst-case scenario that could have happened to us.

The SBG site is powered by a 20kV power line consisting of 2 cables each delivering 10MVA. The 2 cables work together, and are connected to the same source and on the same circuit breaker at ELD (Strasbourg Electricity Networks). This morning, one of the two cables was damaged and the circuit breaker cut power off to the datacenter.

The SBG site is designed to operate, without a time limit, on generators. For SBG1 and SBG4, we have set up a first back up system of 2 generators of 2MVA each, configured in N+1 and 20kv. For SBG2, we have set up 3 groups in N+1 configuration 1.4 MVA each. In the event of an external power failure, the high-voltage cells are automatically reconfigured by a motorized failover system. In less than 30 seconds, SBG1, SBG2 and SBG4 datacenters can have power restored with 20kV. To make this switch-over without cutting power to the servers, we have Uninterrupted Power Supplies (UPS) in place that can maintain power for up to 8 minutes.

This morning, the motorized failover system did not work as expected. The command to start of the backup generators was not given by the NSM. It is an NSM (Normal-emergency motorised), provided by the supplier of the 20kV high voltage cells. We are in contact with the manufacture/suplier to understand the origin of this issue. However, this is a defect that should have been detected during periodic fault simulation tests on the external source. SBG's latest test for backup recovery were at the end of May 2017. During this last test, we powered SBG only from the generators for 8 hours without any issues and every month we test the backup generators with no charge. And despite everything, this system was not enough to avoid today’s soutage.

Around 10am, we managed to switch the cells manually and started to power the datacenter again from the generators. We asked ELD to disconnect the faulty cable from the high voltage cells and switch the circuit breaker on again with only 1 of the 2 cables, and therefore were limited to 10MVA. This action was carried out by ELD and power was restored at approximately 10:30 am. SBG's routers were back online from 10:58 am onwards.

Since then, we have been working on restarting services. Powering the site with energy allows the servers to be restarted, but the services running on the servers still need to be restarted. That's why each service has been coming back gradually since 10:30 am. Our monitoring system allows us to know the list of servers that have successfully started up and those that still have a problem. We intervene on each of these servers to identify and solve the problem that prevents it from restarting.

At 7:50 am, we set up a crisis unit in RBX, where we centralized information and actions of all the different teams involved. A truck from RBX was loaded with spare parts for SBG. It arrived at its destination around 5:30 pm. To help our local teams, we sent teams from the LIM datacenter located in Germany and personnel from RBX datacenter, all of which have been mobilized on site since 4 PM. Currently, more than 50 technicians are working at SBG to get all services back online.

Anyway, the reason of such failure is that SBG's power grid inherited all the design flaws that were the result of the small ambitions initially expected for that location.



Solution and/or improvement

In the coming weeks, we will study new kind of architecture to handle this kind of situation. Anyway, because these two critical issues are located on the datacenter provider side, we will probably have took for another kind of geographical redundancy with at least two different hoster.