The Matrix cluster is currently unavailable due to a problem with its data storage.

12/2013: EZ-Backup data:

Duration of backup: 60 minutes
Total: 5.35 million files backed up, for a total of 1.57TB of data.
- This represents 52% compression (including versions?)
Incremental backup.
- Most recent backup backed up 453K files, for a total of 5GB of data transferred.

Situation

Update

11/15/13, Friday

Matrix is now open to all researchers.

=> Researchers to confirm they have all their data from their clean-up. We recognize this may require running some test jobs. Do this as soon as possible!

Deadline for getting older files back is:

Monday, 11/25 (but may be sooner if spare HDs needed earlier!): After this date, we will be unable to restore any data not on EZ-Backup.

ChemIT's hard drive testing status.

NOTE: ~24 hours testing per drive using drive tools to:

For all drives we conduct a read-only drive check. This is reasonably fast, measured in a few hours for the 3TB drives.
For the 3TB drives, we conduct a 2nd test: Test involving writing 0's to every sector. This takes a long time, measured in the 10's of hours.

Hard drive number	Size	Purpose	Test status	Result, notes
0	3TB	data	N/A	Physically broken connector. File system suspect from initial failure.
1	3TB	data	PASSED	Drive passed testing and was successfully zeroed
2	3TB	data	PASSED	File system suspect from initial failure. - Drive passed testing and zeroing
3	3TB	data	FAILED	Reported ECC error during initial recovery/ failure. - Drive being replaced via Seagate ASAP Replacement drive might be a while as Seagate is having supply issues.
4	3TB	data	PASSED	Drive passed testing and was successfully zeroed
5	3TB	data	PASSED	Drive passed testing and was successfully zeroed
6	160GB	OS	PASSED	Passed long test on 10/15/2013
7	160GB	OS	PASSED	Passed long test on 10/15/2013

OS = Operating System

As of 10/15/13, Tues: Hard drive testing in full swing (see above table).

As of 10/11/13 Fri, Oliver met with Harold and Yi to review next steps. Backup discussion for next week.

As of 10/8/13, 10am: ChemIT staff briefed Yi He in person.

As of 10/7/13, 5pm: Oliver briefed Yi He in person.

As of 10/7/13, 10:30am: Oliver and Michael Hint briefed Harold and Yi in person.

As of 10/4/14, 3:30pm: Oliver briefed Yi He in person. See details in "Status", below.

As of 10/4/14, 11AM: Oliver briefed Yi He in person. See details in "Status", below.

As of 10/3/13, 4pm:

All are efforts are focused on recovering data from the hard disks. Restoring function to Matrix is currently secondary.
We have developed a plan, reviewed with Yi He. However, we must now clarify timelines.
- As we learn more, such as how long copies take, our plans may need adjusting.
  - For example, what purchase decisions are required by when?

As of10/3/13, 10:30am:

We tried copying the data from the Matrix disks to ChemIT disks to create a backup which resides independently from your hardware.
However, when we arrived this morning, we found the copy did not complete (176GB of 3.1TB). And worse, we now can’t see much of the original data.
We have called in additional expertise to further help characterize the problem, especially now that we can’t even see the original data.

As of 10/2/13, 3pm:

Addressing this issue is expected to take days, not hours.

Just copying the data takes a day or more, much less doing all the work required to diagnose and solve the problem.

We still hopes that no data that was on the data storage system has been irretrievably lost.

However, the situation is precarious since two fail-safes have failed.
Since system can only accommodate a loss of 2 hard drives (of 6), we are now at high risk since 2 of the hard drives seemed to have failed. And a third is now issuing warning signs.

Status

As of 10/8/13 (Tues), 11am:

Last night we had some success restoring from the ddrescue'd copy from <scheraga> to an NFS-mounted 6TB volume.

As of 10/4/14, 4pm:

We will shortly start a ddrescue copy from <scheraga> to an NFS-mounted 6TB volume. To run over the weekend, and we expect it to run at least for 2 days, maybe more.
A test RAID controller card (3ware, or 3ware LSI) is to arrive ~1pm. We need that on hand for when we're ready to focus on root cause.
We are starting preliminary work to get our second 6TB temp. storage device set up. This involves pulling hard drives from Collum's cluster and setting up a separate Linux box.
- The two 3TB HDs we ordered yesterday are still scheduled for ~1pm Tuesday arrival. This is for our third 6TB temp. storage device.

As of 10/4/14, AM:

The OS copy running overnight was taking too long so we aborted it so we could focus on the <scheraga> recovery today. We must get back to this copy after we've worked on the <scheraga> partition some more.
We allowed ourselves to reboot Matrix so we could off-load demand on the hardware RAID control and its hard drives by booting off of a LiveUSB OS. This also gave use the opportunity to try different hard disk configurations to try to optimize data visibility, and do so before the multi-day disk imaging effort. We won't know the results of our final configuration choice test until after we've copied off the partition(s).

10/4/13, Thursday afternoon:

We have created space for recovered "OS" files and have started that copy. (dd of 160GB; ~77GB of actual data)

We weren't ready to do the <scheraga> data copy so we started this other one in the meantime.

We are working to create space for recovered "scheraga" files:

2 server-class 3TB drives authorized for use from Collum cluster.
2 more server-class 3TB drives purchased. (Return ones we bought yesterday?)
Consider: Buy 4-6 more server-class 3TB drives. Timing, if needed?

10/3/13, afternoon: ChemIT has placed an order for 3TB for Matrix.

ChemIT has also ordered, for its own general use, a 4TB consumer drive. This can be use for a short-term backup in this situation to further decrease risk of data loss by enable yet one more copy of the data.

10/3/13, 2:45pm: 3TB hard drive approved by Harold Scheraga, for under $300. ChemIT has placed an order.

10/3/12, noonish: Using data on 4 hard disks, using RAID 6 to reconstitute data on 2 drives which test OK separately with a "quick" test.

2% done after over 3 hours...(and then not budging by ~5pm...)

Plan

1) Make a copy of what we can currently see.

A) Copy the vulnerable data before doing anything else.

Provide Yi He a copy of the ~127GB of data we were able to copy the night of 10/3/13. Provide this on a USB drive.

Yi can determine what is included to help inform how researchers need to approach reviewing an incomplete data set.
urrently being copied to a holding location. Then we'll provide yet another copy to Yi.

Get this copy of the data completely off the system.

Any further activity may put the data at more risk so don't do anything other that that required to get this done, until this has been done.
Data was not accessible after one of our reboots. Thus, trying to avoid rebooting the system since data is currently is accessible.
This process may take days, not hours.
- It's ~3.1TB of data, with an enormous number of small files.
Confirm that the data copy is complete.
- Q: How will we know we got all the data?
Further duplicate that data on yet other hard drive(s), especially before deleting any original data.
Use ChemIT's hardware for this temporary storage purposes, as a "loan".
Use ChemIT's hardware and Collum's cluster's hard drives for this temporary storage purposes, as a "loan".

Plan to create four 6TB storage locations:

Storage device (all ChemIT's)	Hard drive(s) ChemIT's	Hard drive(s) Scheraga's	Borrowed from Collum's cluster	Total storage Confirm: Actual needed?	Purpose	Notes
Synology storage device	3TB + 3TB			6TB	Store ddrescue of "Backup" "Image 1" (read-only)
"Dell 1"			Two 3TB's	6TB	Store ddrescue of "Scherago" "Image 2" (read-only)
"Dell 2"		Two 3TB's (on order) Arrive ~Tues, 10/8		6TB	Restoration of "Image 1"
"Dell 3"		Two 3TB's (to order: when?)		6TB	Restoration of "Image 2"
"Dell 4"		Two 3TB's (to order: when?)		6TB	To complete new RAID 6 array with brand new disks.	Necessary?

Costs are $250 for server-class 3TB (vs. $140), each.

B) Copy the OS data, as a precaution.

This process takes hours. 160GB total (matters because of "dd"). And ~77GB of actual data.

Storage device (all ChemIT's)	Hard drive(s) ChemIT's	Hard drive(s) Scheraga's	Total storage Confirm: Actual needed?	Purpose	Notes
Synology storage device	250GB			Just in case.	Not bootable
USB "toaster"	250GB			Bootable

2) Work the problem

Install new hard drive. Reconstitute RAID 6 with this drive (removing one of the suspected drives).

Analyze further one of the two suspected hard drives to try to isolate source of data corruption. Is it the drive? Or, should we instead be looking at the hardware controller?

May need to purchase a second 3TB hard drive. Or, make another investment to get everything working properly again.

3) Debrief and consider investments to reduce future risk

How can the problem be prevented? Is that worth doing?*

What can be done ahead of time to reduce down-time following such a failure in the future? Is that worth doing?*

What can be done to reduce the risk of losing the data due to local failures, such as this one? Is that worth doing?*

* If not worth investing in prevention and risk reduction, clarify the risks being taken and adjust associated service expectations so as not to put an undue strain on IT support resources.

Notes

On 10/2/13 (Wed), Matrix became unavailable.

System has 8 drives on a single hardware controller.

2 (150GB) hard drives for the OS, RAID 1 ("6" and "7").
6 (3 TB) hard drives for the data storage ("0" - "5").

Drives for OS:

Disk number	6	7
Notes	OK	OK

Status: All seems fine
RAID 1 over two 150GB drives. Thus, have access to just under 150GB.

Drives for "Data":

Disk number	0	1	2	3	4	5
Notes	degraded?	OK	degraded?	ECC_Error	OK	OK

Status: 2-3 drives are suspect. Something wrong with hardware controller instead?!
RAID 6 over 3TB drives. Thus, have access to 12 TB (of the 18TB total).
Two partitions, 5.4TB (of 6TB theoretical) each:
- One with the data itself (using 3.1TB of 5.4TB space)
- The other with versioned copies of the data (full, of 5.4TB of space)

Fedora 13: Old OS. This means some tools we want to use that were created for use with a contemporary OS (some of which we've successfully used elsewhere) may not work. Ex:

iSCSI to more quickly moved data to a different hard disk array. And to give us more flexible options to pull the vulnerable data.
Tool to better monitor the hardware disk controller, without requiring system reboots.
- Reboots are to be avoided when we can see the data because the data was not consistently visible after every reboot.