Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

Excerpt

The Matrix cluster is currently unavailable due to a problem with its data storage.

12/2013: EZ-Backup data:

  • Duration of backup: 60 minutes
  • Total: 5.35 million files backed up, for a total of 1.57TB of data.
    • This represents 52% compression (including versions?)
  • Incremental backup.
    • Most recent backup backed up 453K files, for a total of 5GB of data transferred.

Situation

See recovery project's punch list.

Update

11/15/13, Friday

  • Matrix is now open to all researchers.

=> Researchers to confirm they have all their data from their clean-up. We recognize this may require running some test jobs. Do this as soon as possible!

Deadline for getting older files back is:

  • Monday, 11/25 (but may be sooner if spare HDs needed earlier!): After this date, we will be unable to restore any data not on EZ-Backup.

...

ChemIT's hard drive testing status.

NOTE: ~24 hours testing per drive using drive tools to:

  • For all drives we conduct a read-only drive check. This is reasonably fast, measured in a few hours for the 3TB drives.
  • For the 3TB drives, we conduct a 2nd test: Test involving writing 0's to every sector. This takes a long time, measured in the 10's of hours.

Hard drive number

Size

Purpose

Test status

Result, notes

0

3TB

data

N/A

Physically broken connector.
File system suspect from initial failure.

1

3TB

data

PASSED

Drive passed testing and was successfully zeroed

2

3TB

data

PASSED

File system suspect from initial failure. - Drive passed testing and zeroing

3

3TB

data

 FAILED

Reported ECC error during initial recovery/ failure. - Drive being replaced via Seagate ASAP
Replacement drive might be a while as Seagate is having supply issues.

4

3TB

data

PASSED

Drive passed testing and was successfully zeroed

5

3TB

data

PASSED

Drive passed testing and was successfully zeroed

6

160GB

OS

PASSED

Passed long test on 10/15/2013

7

160GB

OS

PASSED

Passed long test on 10/15/2013

OS = Operating System

...

As of 10/15/13, Tues: Hard drive testing in full swing (see above table).

As of 10/11/13 Fri, Oliver met with Harold and Yi to review next steps. Backup discussion for next week.

As of 10/8/13, 10am: ChemIT staff briefed Yi He in person.

As of 10/7/13, 5pm: Oliver briefed Yi He in person.

As of 10/7/13, 10:30am: Oliver and Michael Hint briefed Harold and Yi in person.

As of 10/4/14, 3:30pm: Oliver briefed Yi He in person. See details in "Status", below.

As of 10/4/14, 11AM: Oliver briefed Yi He in person. See details in "Status", below.

As of 10/3/13, 4pm:

  • All are efforts are focused on recovering data from the hard disks. Restoring function to Matrix is currently secondary.
  • We have developed a plan, and are now clarifying time-lines, reviewed with Yi He. However, we must now clarify timelines.
    • As we learn more, such as how long copies take, our plans may need adjusting.
      • For example, what purchase decisions are required by when?

As of10/43/13, 10:30am:

  • We tried copying the data from the Matrix disks to ChemIT disks to create a backup which resides independently from your hardware.
  • However, when we arrived this morning, we found the copy did not complete (176GB of 3.1TB). And worse, we now can't can’t see much of the original data.
  • We have called in additional expertise to further help characterize the problem, especially now that we can't can’t even see the original data.

As of 10/32/13, 3pm:

Addressing this issue is expected to take days, not hours.

...

  • However, the situation is precarious since two fail-safes have failed.
  • Since system can only accommodate a loss of 2 hard drives (of 6), we are now at high risk since 2 of the hard drives seemed to have failed. And a third is now issuing warning signs.

Status

As of 10/8/13 (Tues), 11am:

  • Last night we had some success restoring from the ddrescue'd copy from <scheraga> to an NFS-mounted 6TB volume.

As of 10/4/14, 4pm:

  • We will shortly start a ddrescue copy from <scheraga> to an NFS-mounted 6TB volume. To run over the weekend, and we expect it to run at least for 2 days, maybe more.
  • A test RAID controller card (3ware, or 3ware LSI) is to arrive ~1pm. We need that on hand for when we're ready to focus on root cause.
  • We are starting preliminary work to get our second 6TB temp. storage device set up. This involves pulling hard drives from Collum's cluster and setting up a separate Linux box.
    • The two 3TB HDs we ordered yesterday are still scheduled for ~1pm Tuesday arrival. This is for our third 6TB temp. storage device.

As of 10/4/14, AM:

  • The OS copy running overnight was taking too long so we aborted it so we could focus on the <scheraga> recovery today. We must get back to this copy after we've worked on the <scheraga> partition some more.
  • We allowed ourselves to reboot Matrix so we could off-load demand on the hardware RAID control and its hard drives by booting off of a LiveUSB OS. This also gave use the opportunity to try different hard disk configurations to try to optimize data visibility,  and do so before the multi-day disk imaging effort. We won't know the results of our final configuration choice test until after we've copied off the partition(s).

10/4/13, Thursday afternoon:

We have created space for recovered "OS" files and have started that copy. (dd of 160GB; ~77GB of actual data)

  • We weren't ready to do the <scheraga> data copy so we started this other one in the meantime.

We are working to create Creating space for recovered "scheraga" files:

  • 2 server-class 3TB drives authorized for use from Collum cluster.
  • 2 more server-class 3TB drives purchased. (Return ones we bought yesterday?)
  • Consider: Buy 4-6 more server-class 3TB drives. Timing, if needed?

...

10/3/13, afternoon: ChemIT has placed an order for 3TB for Matrix.

...