Excerpt
The Matrix cluster is currently unavailable due to a problem with its data storage.

12/2013: EZ-Backup data:

Duration of backup: 60 minutes
Total: 5.35 million files backed up, for a total of 1.57TB of data.
- This represents 52% compression (including versions?)
Incremental backup.
- Most recent backup backed up 453K files, for a total of 5GB of data transferred.

Situation

See recovery project's punch list.

Update

11/15/13, Friday

Matrix is now open to all researchers.

=> Researchers to confirm they have all their data from their clean-up. We recognize this may require running some test jobs. Do this as soon as possible!

Deadline for getting older files back is:

Monday, 11/25 (but may be sooner if spare HDs needed earlier!): After this date, we will be unable to restore any data not on EZ-Backup.

...

ChemIT's hard drive testing status.

NOTE: ~24 hours testing per drive using drive tools to:

For all drives we conduct a read-only drive check. This is reasonably fast, measured in a few hours for the 3TB drives.
For the 3TB drives, we conduct a 2nd test: Test involving writing 0's to every sector. This takes a long time, measured in the 10's of hours.

Hard drive number	Size	Purpose	Test status	Result, notes
0	3TB	data	N/A	Physically broken connector. File system suspect from initial failure.
1	3TB	data	PASSED	Drive passed testing and was successfully zeroed
2	3TB	data	PASSED	File system suspect from initial failure. - Drive passed testing and zeroing
3	3TB	data	FAILED	Reported ECC error during initial recovery/ failure. - Drive being replaced via Seagate ASAP Replacement drive might be a while as Seagate is having supply issues.
4	3TB	data	PASSED	Drive passed testing and was successfully zeroed
5	3TB	data	PASSED	Drive passed testing and was successfully zeroed
6	160GB	OS	PASSED	Passed long test on 10/15/2013
7	160GB	OS	PASSED	Passed long test on 10/15/2013

OS = Operating System

...

As of 10/15/13, Tues: Hard drive testing in full swing (see above table).

As of 10/11/13 Fri, Oliver met with Harold and Yi to review next steps. Backup discussion for next week.

As of 10/8/13, 10am: ChemIT staff briefed Yi He in person.

As of 10/7/13, 5pm: Oliver briefed Yi He in person.

As of 10/7/13, 10:30am: Oliver and Michael Hint briefed Harold and Yi in person.

As of 10/4/14, 3:30pm: Oliver briefed Yi He in person. See details in "Status", below.

As of 10/4/14, 11AM: Oliver briefed Yi He in person. See details in "Status", below.

As of 10/3/13, 4pm:

All are efforts are focused on recovering data from the hard disks. Restoring function to Matrix is currently secondary.
We have developed a plan, and are now clarifying time-lines, reviewed with Yi He. However, we must now clarify timelines.
- As we learn more, such as how long copies take, our plans may need adjusting.
  - For example, what purchase decisions are required by when?

As of10/43/13, 10:30am:

We tried copying the data from the Matrix disks to ChemIT disks to create a backup which resides independently from your hardware.
However, when we arrived this morning, we found the copy did not complete (176GB of 3.1TB). And worse, we now can't can’t see much of the original data.
We have called in additional expertise to further help characterize the problem, especially now that we can't can’t even see the original data.

As of 10/32/13, 3pm:

Addressing this issue is expected to take days, not hours.

...

However, the situation is precarious since two fail-safes have failed.
Since system can only accommodate a loss of 2 hard drives (of 6), we are now at high risk since 2 of the hard drives seemed to have failed. And a third is now issuing warning signs.

Status

As of 10/8/13 (Tues), 11am:

Last night we had some success restoring from the ddrescue'd copy from <scheraga> to an NFS-mounted 6TB volume.

As of 10/4/14, 4pm:

We will shortly start a ddrescue copy from <scheraga> to an NFS-mounted 6TB volume. To run over the weekend, and we expect it to run at least for 2 days, maybe more.
A test RAID controller card (3ware, or 3ware LSI) is to arrive ~1pm. We need that on hand for when we're ready to focus on root cause.
We are starting preliminary work to get our second 6TB temp. storage device set up. This involves pulling hard drives from Collum's cluster and setting up a separate Linux box.
- The two 3TB HDs we ordered yesterday are still scheduled for ~1pm Tuesday arrival. This is for our third 6TB temp. storage device.

As of 10/4/14, AM:

The OS copy running overnight was taking too long so we aborted it so we could focus on the <scheraga> recovery today. We must get back to this copy after we've worked on the <scheraga> partition some more.
We allowed ourselves to reboot Matrix so we could off-load demand on the hardware RAID control and its hard drives by booting off of a LiveUSB OS. This also gave use the opportunity to try different hard disk configurations to try to optimize data visibility, and do so before the multi-day disk imaging effort. We won't know the results of our final configuration choice test until after we've copied off the partition(s).

10/4/13, Thursday afternoon:

We have created space for recovered "OS" files and have started that copy. (dd of 160GB; ~77GB of actual data)

We weren't ready to do the <scheraga> data copy so we started this other one in the meantime.

We are working to create Creating space for recovered "scheraga" files:

2 server-class 3TB drives authorized for use from Collum cluster.
2 more server-class 3TB drives purchased. (Return ones we bought yesterday?)
Consider: Buy 4-6 more server-class 3TB drives. Timing, if needed?

...

10/3/13, afternoon: ChemIT has placed an order for 3TB for Matrix.

...

Space shortcuts

Child pages

Versions Compared

Old Version 14

New Version Current

Key

12/2013: EZ-Backup data:

Situation

Update

Deadline for getting older files back is:

Status

Space shortcuts

Child pages

Page History

Versions Compared

Old Version 14

New Version Current

Key

12/2013: EZ-Backup data:

Situation

Update

Deadline for getting older files back is:

Status