The Matrix cluster is currently unavailable due to a problem with its data storage.

Situation

As of 10/4/13, 4pm:

All are efforts are focused on recovering data from the hard disks.
We have developed a plan, and are now clarifying time-lines.
- As we learn more, our plans may need adjusting.
  - For example, what purchase decisions are required by when?

As of10/4/13, 10:30am:

We tried copying the data from the Matrix disks to ChemIT disks to create a backup which resides independently from your hardware.
However, when we arrived this morning, we found the copy did not complete (176GB of 3.1TB). And worse, we now can't see much of the original data.
We have called in additional expertise to further help characterize the problem, especially now that we can't even see the original data.

As of 10/3/13, 3pm:

Addressing this issue is expected to take days, not hours.

Just copying the data takes a day or more, much less doing all the work required to diagnose and solve the problem.

We still hopes that no data that was on the data storage system has been irretrievably lost.

However, the situation is precarious since two fail-safes have failed.
Since system can only accommodate a loss of 2 hard drives (of 6), we are now at high risk since 2 of the hard drives seemed to have failed. And a third is now issuing warning signs.

Status

10/4/13, Thursday afternoon:

Creating space for recovered "scheraga" files:

2 server-class 3TB drives authorized for use from Collum cluster.
2 more server-class 3TB drives purchased. (Return ones we bought yesterday?)
Consider: Buy 4-6 more server-class 3TB drives. Timing, if needed?

We have created space for recovered "OS" files and have started that copy.

10/3/13, afternoon: ChemIT has placed an order for 3TB for Matrix.

ChemIT has also ordered, for its own general use, a 4TB consumer drive. This can be use for a short-term backup in this situation to further decrease risk of data loss by enable yet one more copy of the data.

10/3/13, 2:45pm: 3TB hard drive approved by Harold Scheraga, for under $300. ChemIT has placed an order.

10/3/12, noonish: Using data on 4 hard disks, using RAID 6 to reconstitute data on 2 drives which test OK separately with a "quick" test.

2% done after over 3 hours...(and then not budging by ~5pm...)

Plan

1) Make a copy of what we can currently see.

A) Copy the vulnerable data before doing anything else.

Provide Yi He a copy of the ~127GB of data we were able to copy the night of 10/3/13. Provide this on a USB drive.

Yi can determine what is included to help inform how researchers need to approach reviewing an incomplete data set.
urrently being copied to a holding location. Then we'll provide yet another copy to Yi.

Get this copy of the data completely off the system.

Any further activity may put the data at more risk so don't do anything other that that required to get this done, until this has been done.
Data was not accessible after one of our reboots. Thus, trying to avoid rebooting the system since data is currently is accessible.
This process may take days, not hours.
- It's ~3.1TB of data, with an enormous number of small files.
Confirm that the data copy is complete.
- Q: How will we know we got all the data?
Further duplicate that data on yet other hard drive(s), especially before deleting any original data.
Use ChemIT's hardware for this temporary storage purposes, as a "loan".
Use ChemIT's hardware and Collum's cluster's hard drives for this temporary storage purposes, as a "loan".

Plan to create four 6TB storage locations:

Storage device (all ChemIT's)	Hard drive(s) ChemIT's	Hard drive(s) Scheraga's	Borrowed from Collum's cluster	Total storage Confirm: Actual needed?	Purpose	Notes
Synology storage device	3TB + 3TB			6TB	Store ddrescue of "Backup" "Image 1" (read-only)
"Dell 1"			Two 3TB's	6TB	Store ddrescue of "Scherago" "Image 2" (read-only)
"Dell 2"		Two 3TB's (on order) Arrive ~Tues, 10/8		6TB	Restoration of "Image 1"
"Dell 3"		Two 3TB's (to order: when?)		6TB	Restoration of "Image 2"
"Dell 4"		Two 3TB's (to order: when?)		6TB	To complete new RAID 6 array with brand new disks.	Necessary?

Costs are $250 for server-class 3TB (vs. $140), each.

B) Copy the OS data, as a precaution.

This process takes hours. 160GB total (matters because of "dd"). And ~77GB of actual data.

Storage device (all ChemIT's)	Hard drive(s) ChemIT's	Hard drive(s) Scheraga's	Total storage Confirm: Actual needed?	Purpose	Notes
Synology storage device	250GB			Just in case.	Not bootable
USB "toaster"	250GB			Bootable

2) Work the problem

Install new hard drive. Reconstitute RAID 6 with this drive (removing one of the suspected drives).

Analyze further one of the two suspected hard drives to try to isolate source of data corruption. Is it the drive? Or, should we instead be looking at the hardware controller?

May need to purchase a second 3TB hard drive. Or, make another investment to get everything working properly again.

3) Debrief and consider investments to reduce future risk

How can the problem be prevented? Is that worth doing?*

What can be done ahead of time to reduce down-time following such a failure in the future? Is that worth doing?*

What can be done to reduce the risk of losing the data due to local failures, such as this one? Is that worth doing?*

* If not worth investing in prevention and risk reduction, clarify the risks being taken and adjust associated service expectations so as not to put an undue strain on IT support resources.

Notes

On 10/2/13 (Wed), Matrix became unavailable.

System has 8 drives on a single hardware controller.

2 (150GB) hard drives for the OS, RAID 1 ("6" and "7").
6 (3 TB) hard drives for the data storage ("0" - "5").

Drives for OS:

Disk number	6	7
Notes	OK	OK

Status: All seems fine
RAID 1 over two 150GB drives. Thus, have access to just under 150GB.

Drives for "Data":

Disk number	0	1	2	3	4	5
Notes	degraded?	OK	degraded?	ECC_Error	OK	OK

Status: 2-3 drives are suspect. Something wrong with hardware controller instead?!
RAID 6 over 3TB drives. Thus, have access to 12 TB (of the 18TB total).
Two partitions, 5.4TB (of 6TB theoretical) each:
- One with the data itself (using 3.1TB of 5.4TB space)
- The other with versioned copies of the data (full, of 5.4TB of space)

Fedora 13: Old OS. This means some tools we want to use that were created for use with a contemporary OS (some of which we've successfully used elsewhere) may not work. Ex:

iSCSI to more quickly moved data to a different hard disk array. And to give us more flexible options to pull the vulnerable data.
Tool to better monitor the hardware disk controller, without requiring system reboots.
- Reboots are to be avoided when we can see the data because the data was not consistently visible after every reboot.

Space shortcuts

Child pages

Situation

Status

Plan

1) Make a copy of what we can currently see.

A) Copy the vulnerable data before doing anything else.

B) Copy the OS data, as a precaution.

2) Work the problem

3) Debrief and consider investments to reduce future risk

Notes

Space shortcuts

Child pages

Scheraga - Data corruption problem on Matrix cluster

Situation

Status

Plan

1) Make a copy of what we can currently see.

A) Copy the vulnerable data before doing anything else.

B) Copy the OS data, as a precaution.

2) Work the problem

3) Debrief and consider investments to reduce future risk

Notes