Excerpt |
---|
The Matrix cluster is currently unavailable due to a problem with its data storage. |
12/2013: EZ-Backup data:
- Duration of backup: 60 minutes
- Total: 5.35 million files backed up, for a total of 1.57TB of data.
- This represents 52% compression (including versions?)
- Incremental backup.
- Most recent backup backed up 453K files, for a total of 5GB of data transferred.
Situation
See recovery project's punch list.
Update
11/15/13, Friday
- Matrix is now open to all researchers.
=> Researchers to confirm they have all their data from their clean-up. We recognize this may require running some test jobs. Do this as soon as possible!
Deadline for getting older files back is:
- Monday, 11/25 (but may be sooner if spare HDs needed earlier!): After this date, we will be unable to restore any data not on EZ-Backup.
...
ChemIT's hard drive testing status.
NOTE: ~24 hours testing per drive using drive tools to:
- For all drives we conduct a read-only drive check. This is reasonably fast, measured in a few hours for the 3TB drives.
- For the 3TB drives, we conduct a 2nd test: Test involving writing 0's to every sector. This takes a long time, measured in the 10's of hours.
Hard drive number | Size | Purpose | Test status | Result, notes |
---|---|---|---|---|
0 | 3TB | data | N/A | Physically broken connector. |
1 | 3TB | data | PASSED | Drive passed testing and was successfully zeroed |
2 | 3TB | data | PASSED | File system suspect from initial failure. - Drive passed testing and zeroing |
3 | 3TB | data | FAILED | Reported ECC error during initial recovery/ failure. - Drive being replaced via Seagate ASAP |
4 | 3TB | data | PASSED | Drive passed testing and was successfully zeroed |
5 | 3TB | data | PASSED | Drive passed testing and was successfully zeroed |
6 | 160GB | OS | PASSED | Passed long test on 10/15/2013 |
7 | 160GB | OS | PASSED | Passed long test on 10/15/2013 |
OS = Operating System
...
As of 10/15/13, Tues: Hard drive testing in full swing (see above table).
As of 10/11/13 Fri, Oliver met with Harold and Yi to review next steps. Backup discussion for next week.
As of 10/8/13, 10am: ChemIT staff briefed Yi He in person.
As of 10/7/13, 5pm: Oliver briefed Yi He in person.
As of 10/7/13, 10:30am: Oliver and Michael Hint briefed Harold and Yi in person.
As of 10/4/14, 3:30pm: Oliver briefed Yi He in person. See details in "Status", below.
As of 10/4/14, 11AM: Oliver briefed Yi He in person. See details in "Status", below.
As of 10/3/13, 4pm:
- All are efforts are focused on recovering data from the hard disks. Restoring function to Matrix is currently secondary.
- We have developed a plan, and are now clarifying time-lines, reviewed with Yi He. However, we must now clarify timelines.
- As we learn more, such as how long copies take, our plans may need adjusting.
- For example, what purchase decisions are required by when?
- As we learn more, such as how long copies take, our plans may need adjusting.
As of10/43/13, 10:30am:
- We tried copying the data from the Matrix disks to ChemIT disks to create a backup which resides independently from your hardware.
- However, when we arrived this morning, we found the copy did not complete (176GB of 3.1TB). And worse, we now can't can’t see much of the original data.
- We have called in additional expertise to further help characterize the problem, especially now that we can't can’t even see the original data.
As of 10/32/13, 3pm:
Addressing this issue is expected to take days, not hours.
...
- However, the situation is precarious since two fail-safes have failed.
- Since system can only accommodate a loss of 2 hard drives (of 6), we are now at high risk since 2 of the hard drives seemed to have failed. And a third is now issuing warning signs.
Status
As of 10/8/13 (Tues), 11am:
- Last night we had some success restoring from the ddrescue'd copy from <scheraga> to an NFS-mounted 6TB volume.
As of 10/4/14, 4pm:
- We will shortly start a ddrescue copy from <scheraga> to an NFS-mounted 6TB volume. To run over the weekend, and we expect it to run at least for 2 days, maybe more.
- A test RAID controller card (3ware, or 3ware LSI) is to arrive ~1pm. We need that on hand for when we're ready to focus on root cause.
- We are starting preliminary work to get our second 6TB temp. storage device set up. This involves pulling hard drives from Collum's cluster and setting up a separate Linux box.
- The two 3TB HDs we ordered yesterday are still scheduled for ~1pm Tuesday arrival. This is for our third 6TB temp. storage device.
As of 10/4/14, AM:
- The OS copy running overnight was taking too long so we aborted it so we could focus on the <scheraga> recovery today. We must get back to this copy after we've worked on the <scheraga> partition some more.
- We allowed ourselves to reboot Matrix so we could off-load demand on the hardware RAID control and its hard drives by booting off of a LiveUSB OS. This also gave use the opportunity to try different hard disk configurations to try to optimize data visibility, and do so before the multi-day disk imaging effort. We won't know the results of our final configuration choice test until after we've copied off the partition(s).
10/4/13, Thursday afternoon:
We have created space for recovered "OS" files and have started that copy. (dd of 160GB; ~77GB of actual data)
- We weren't ready to do the <scheraga> data copy so we started this other one in the meantime.
We are working to create Creating space for recovered "scheraga" files:
- 2 server-class 3TB drives authorized for use from Collum cluster.
- 2 more server-class 3TB drives purchased. (Return ones we bought yesterday?)
- Consider: Buy 4-6 more server-class 3TB drives. Timing, if needed?
...
10/3/13, afternoon: ChemIT has placed an order for 3TB for Matrix.
...