We have experienced three incidents in November 2013 affecting our Web Hosting services. These occurred at our Gallo Manor Data Centre. The incidents were as follows:
15 November - Problems related to Cloud Storage Firmware Update Resulting in Slow or Unresponsive Cloud Servers
26 November - DOS (Denial of Service) attack resulting in a loss of connectivity, and network issues preventing restoring of normal connectivity and services
27 November - Broadcast Storm within the Data Centre, resulting in slow connectivity and VM countermeasures being partially deployed
Our intention is give a detailed description of the incidents in question and how they were resolved, in the interests of transparency.
Friday 15 November 2013
A firmware update was run on one of our storage infrastructures in our Cloud environment, which was recommended by the vendor to avoid any imminent possible errors or complications. Every precaution was taken by Afrihost’s team to ensure downtime would be minimised or avoided. However the update was followed by unusually high load on the CPU controller. An immediate call was logged with the vendor to troubleshoot the firmware. The developers made several suggestions which were implemented.
Ultimately, to minimise the impact to our clients, our team implemented a workaround which neutralised the problem. We are still awaiting a full incident report from the vendor, but the hardware is currently stable and performing optimally.
Tuesday 26 November 2013
A “DOS” attack (Denial of Service) on a host within our Data Centre was staged in the early hours of the morning resulting in a flood of UDP (User Datagram Protocol) packets on our data centre network. DOS attacks overwhelm a target with connection requests, to render the target inaccessible. The effect of a DOS attack is that essentially thousands of requests per second flood the network for a single server or service, rendering general traffic slow to unresponsive for other servers within the data centre cluster. DOS attacks are often launched using “zombie machines in a botnet” (malware infected servers communicating with each other in a coordinated attack) which make it difficult for the attackers to be identified or neutralised. Ultimately there is no way for Firewalls to distinguish legitimate traffic from DOS traffic without specific intervention (i.e. removing either the target or the source from the network).
Our team investigated, with the assistance of MTN, to isolate the source of the traffic. Once the affected server was removed from the network, there was still a significantly high load on the core router, causing extremely slow or unresponsive traffic in and out of the data centre. We reported intermittent packet drops to MTN, who immediately teleconferenced with Cisco for their advice. Urgent escalation processes were also followed to MTN management in light of the severity of the impact to clients.
After further consultation with MTN’s senior network engineers and our team onsite, a problem on the core router was identified and normal traffic to and from the data centre was restored.
It is important to note that no servers were down during this time, symptoms experienced were due to connectivity errors due to the network issue.
Wednesday 27 November 2013
Between 1am and 2am, a “Broadcast Storm” occurred on the storage network within our virtual environment. A Broadcast Storm is very different to a DOS attack, although it may ultimately yield similar results. It essentially means that the storage network (which is separate from the virtual machine network) in question started sending out massive amounts of traffic from within the data centre, resulting in security countermeasures being deployed. One of these measures, Storm Control, automatically switches of storage ports on the virtual environment to safeguard valuable client data from corruption.
When our team was alerted to the issue, they immediately attended to the problem. Storage was brought back up within 20 minutes. However some of the redundancy measures had already activated. Some ESX hosts which were seen as isolated on the network began shutting down VMs (virtual machines), which is a function of the VMware to move the affected VMs to alternate hosts as an HA (High Availability) failover. This occurred within a matter of minutes. However, since the storage ports were still offline, the VMs were unable to restart on the new hosts. Of the approximately 700 virtual servers affected, 65% of affected servers powered back up automatically without incident.
The remaining 35% were unable to restart. When the servers were migrated to new hosts, they were still unable to access storage ports which were still offline, and we not able to go online again. However these were already assigned to the new hosts due to the HA failover scenario.
Resolving this entailed manually reviewing all offline VM servers, reassigning them to the original hosts and restoring them individually, which required all available human resources. The entire process took approximately 4 hours to complete.
Communications were sent via SMS from Afrihost management to alert affected clients to the issue and potential impact to their services.