We experienced an incident at our Gallo Manor Data Centre between 20 and 25 January 2014 affecting our Cloud Hosting services.
Our intention is to give a detailed description of the incident in question and how it was resolved.
Sunday 19 January
An Afrihost senior engineer escalated decreased storage performance to the storage vendors at approximately 22h00.
At approximately midnight a fiber cable was replaced as it was believed to be the cause in increasing latency to and from the storage volume.
The corrective measure was believed to be a success as the storage performance had normalised.
Monday 20 January
At the start of business, Afrihost engineers found the storage performance to be inconsistent and the call was escalated with the Storage Vendors.
A further cloud internal network change was made in an effort to resolve the inconsistent storage performance.
The affected storage performance appeared stable and was monitored closely.
Wednesday 22 January
At approximately 06h00 an Afrihost engineer reported decreased storage performance and the opened call was again escalated to our storage vendors.
Disk controllers high load was believed to be the cause of the decreased storage performance. The affected volume was moved to the secondary active controller and the primary controller was rebooted.
This unfortunately did not resolve the problem as the problem seemed to move with the affected volume.
The cloud servers on the affected volume experienced reduced disk performance which was symptomized by slow responses from the affected cloud servers.
In an effort to decrease load on the storage volume, Afrihost engineers began migrating cloud servers to unaffected volumes through the night.
Thursday 23 January
An automatic trigger took the affected storage file system offline to protect the file system.
In order to bring the affected cloud servers back online, the volume required a file system check which was estimated to take upwards of 50 hours to complete.
In an attempt to restore client services as soon as possible, the affected volume was mounted in Read Only Mode and Afrihost continued to migrate affected cloud servers off the affected volume.
Early Thursday evening, the storage vendors found a work around to bringing up the file system in Read/Write mode without doing a file system check.
The work around was applied and all cloud servers were brought back online. Afrihost engineers checked all affected cloud servers to make sure all services were restored.
Friday 24 January
At approximately 01h30 the affected cloud environment was believed to be stable as all reports indicated disk performance had normalised.
The call remained open with the storage vendor in anticipation of a permanent resolution.
Although all services were restored, the affected storage performance decreased through most of the day.
Additional storage capacity was introduced to the cloud environment to spread and balance the workload.
At approximately 14h30, Storage performance on the affected Volume had degraded severely.
After thoroughly assessing all avenues, all affected cloud servers were gracefully shut down to complete the remaining migrations to unaffected Volumes as fast as possible.
Saturday 25 January
Our engineers had been monitoring the cloud migrations through the night and resumed all services Saturday morning.
Afrihost engineers continued to migrate all cloud servers off the affected volumes to prevent further outages and the affected volumes are being closely monitored for reduced performance.
Affected cloud servers were migrated to unaffected storage volumes. Additional capacity was introduced to the affected cloud environment. Root cause:
The vendor’s diagnosis indicate the root cause to be poor performing disks.