The Matrix cluster is currently unavailable due to a problem with its data storage.

12/2013: EZ-Backup data:

Situation

See recovery project's punch list.

Update

11/15/13, Friday

=> Researchers to confirm they have all their data from their clean-up. We recognize this may require running some test jobs. Do this as soon as possible!

Deadline for getting older files back is:


ChemIT's hard drive testing status.

NOTE: ~24 hours testing per drive using drive tools to:

Hard drive number

Size

Purpose

Test status

Result, notes

0

3TB

data

N/A

Physically broken connector.
File system suspect from initial failure.

1

3TB

data

PASSED

Drive passed testing and was successfully zeroed

2

3TB

data

PASSED

File system suspect from initial failure. - Drive passed testing and zeroing

3

3TB

data

 FAILED

Reported ECC error during initial recovery/ failure. - Drive being replaced via Seagate ASAP
Replacement drive might be a while as Seagate is having supply issues.

4

3TB

data

PASSED

Drive passed testing and was successfully zeroed

5

3TB

data

PASSED

Drive passed testing and was successfully zeroed

6

160GB

OS

PASSED

Passed long test on 10/15/2013

7

160GB

OS

PASSED

Passed long test on 10/15/2013

OS = Operating System


As of 10/15/13, Tues: Hard drive testing in full swing (see above table).

As of 10/11/13 Fri, Oliver met with Harold and Yi to review next steps. Backup discussion for next week.

As of 10/8/13, 10am: ChemIT staff briefed Yi He in person.

As of 10/7/13, 5pm: Oliver briefed Yi He in person.

As of 10/7/13, 10:30am: Oliver and Michael Hint briefed Harold and Yi in person.

As of 10/4/14, 3:30pm: Oliver briefed Yi He in person. See details in "Status", below.

As of 10/4/14, 11AM: Oliver briefed Yi He in person. See details in "Status", below.

As of 10/3/13, 4pm:

As of10/3/13, 10:30am:

As of 10/2/13, 3pm:

Addressing this issue is expected to take days, not hours.

We still hopes that no data that was on the data storage system has been irretrievably lost.

Status

As of 10/8/13 (Tues), 11am:

As of 10/4/14, 4pm:

As of 10/4/14, AM:

10/4/13, Thursday afternoon:

We have created space for recovered "OS" files and have started that copy. (dd of 160GB; ~77GB of actual data)

We are working to create space for recovered "scheraga" files:

10/3/13, afternoon: ChemIT has placed an order for 3TB for Matrix.

10/3/13, 2:45pm: 3TB hard drive approved by Harold Scheraga, for under $300. ChemIT has placed an order.

10/3/12, noonish: Using data on 4 hard disks, using RAID 6 to reconstitute data on 2 drives which test OK separately with a "quick" test.

Plan

1) Make a copy of what we can currently see.

A) Copy the vulnerable data before doing anything else.

Provide Yi He a copy of the ~127GB of data we were able to copy the night of 10/3/13. Provide this on a USB drive.

Get this copy of the data completely off the system.

Plan to create four 6TB storage locations:

Storage device
(all ChemIT's)

Hard drive(s)
ChemIT's

Hard drive(s)
Scheraga's

Borrowed from
Collum's cluster

Total storage
Confirm: Actual needed?

Purpose

Notes

Synology storage device

3TB + 3TB


 

6TB

Store ddrescue of "Backup"
"Image 1" (read-only)


"Dell 1"



Two 3TB's

6TB

Store ddrescue of "Scherago"
"Image 2" (read-only)

 

"Dell 2"


Two 3TB's (on order)
Arrive ~Tues, 10/8

 

6TB

Restoration of "Image 1"

 

"Dell 3"

 

Two 3TB's (to order: when?)

 

6TB

Restoration of "Image 2"

 

"Dell 4"

 

Two 3TB's (to order: when?)

 

6TB

To complete new RAID 6 array with brand new disks.

Necessary?

B) Copy the OS data, as a precaution.

Storage device
(all ChemIT's)

Hard drive(s)
ChemIT's

Hard drive(s)
Scheraga's

Total storage
Confirm: Actual needed?

Purpose

Notes

Synology storage device

250GB

 

 

Just in case.

Not bootable

USB "toaster"

250GB

 

 

Bootable

 

2) Work the problem

Install new hard drive. Reconstitute RAID 6 with this drive (removing one of the suspected drives).

Analyze further one of the two suspected hard drives to try to isolate source of data corruption. Is it the drive? Or, should we instead be looking at the hardware controller?

May need to purchase a second 3TB hard drive. Or, make another investment to get everything working properly again.

3) Debrief and consider investments to reduce future risk

How can the problem be prevented? Is that worth doing?*

What can be done ahead of time to reduce down-time following such a failure in the future? Is that worth doing?*

What can be done to reduce the risk of losing the data due to local failures, such as this one? Is that worth doing?*

* If not worth investing in prevention and risk reduction, clarify the risks being taken and adjust associated service expectations so as not to put an undue strain on IT support resources.

Notes

On 10/2/13 (Wed), Matrix became unavailable.

System has 8 drives on a single hardware controller.

Drives for OS:

Disk number

6

7

Notes

OK

OK

Drives for "Data":

Disk number

0

1

2

3

4

5

Notes

degraded?

OK

degraded?

ECC_Error

OK

OK

Fedora 13: Old OS. This means some tools we want to use that were created for use with a contemporary OS  (some of which we've successfully used elsewhere) may not work. Ex: