You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 13 Next »

The Matrix cluster is currently unavailable due to a problem with its data storage.

Situation

As of10/4/13, 10:30am:

  • We tried copying the data from your disks to our disks to create a backup which resides independently from your hardware.
  • However, when we arrived this morning, we found the copy did not complete (176GB of 3.1TB). And worse, we now can't see much of the original data.
  • We have called in additional expertise to further help characterize the problem, especially now that we can't even see the original data.

As of 10/3/13, 3pm:

Addressing this issue is expected to take days, not hours.

  • Just copying the data takes a day or more, much less doing all the work required to diagnose and solve the problem.

We still hopes that no data that was on the data storage system has been irretrievably lost.

  • However, the situation is precarious since two fail-safes have failed.
  • Since system can only accommodate a loss of 2 hard drives (of 6), we are now at high risk since 2 of the hard drives seemed to have failed. And a third is now issuing warning signs.

Status

10/3/13, afternoon: ChemIT has placed an order for 3TB for Matrix.

  • ChemIT has also ordered, for its own general use, a 4TB consumer drive. This can be use for a short-term backup in this situation to further decrease risk of data loss by enable yet one more copy of the data.

10/3/13, 2:45pm: 3TB hard drive approved by Harold Scheraga, for under $300. ChemIT has placed an order.

10/3/12, noonish: Using data on 4 hard disks, using RAID 6 to reconstitute data on 2 drives which test OK separately with a "quick" test.

  • 2% done after over 3 hours...

Plan

1) Make a copy

Copy the vulnerable data before doing anything else.

  • Any further activity may put the data at more risk so don't do anything other that that required to get this done, until this has been done.
  • Data was not accessible after one of our reboots. Thus, trying to avoid rebooting the system since data is currently is accessible.

Get this copy of the data completely off the system.

  • This process may take days, not hours.
    • It's ~3.1TB of data, with an enormous number of small files.
  • Confirm that the data copy is complete.
  • Further duplicate that data on yet other hard drive(s), especially before deleting any original data.
  • Use ChemIT's hardware for this temporary storage purposes, as a "loan".

Copy the OS data, as a precaution.

  • This process takes hours. It's less than 250GB, the size of partition.

Plan to create four ~(>?)6TB storage locations

Storage device
(all ChemIT's)

Hard drive(s)
ChemIT's

Hard drive(s)
Scheraga's

Total storage
Confirm: Actual needed?

Purpose

Notes

Synology storage device

3TB + 3TB + 0.25TB

-

6.25TB

Store ddrescue of "Backup"
"Image 1" (read-only)


"Dell 1"

4TB
(on order)

3TB
(on order)

7TB

Store ddrescue of "Scherago"
"Image 2" (read-only)

 

"Dell 2"

 

 

7TB

Restoration of "Image 1"

 

"Dell 3"

 

 

7TB

Restoration of "Image 1"

 

2) Work the problem

Install new hard drive. Reconstitute RAID 6 with this drive (removing one of the suspected drives).

Analyze further one of the two suspected hard drives to try to isolate source of data corruption. Is it the drive? Or, should we instead be looking at the hardware controller?

May need to purchase a second 3TB hard drive. Or, make another investment to get everything working properly again.

3) Debrief and consider investments to reduce future risk

How can the problem be prevented? Is that worth doing?*

What can be done ahead of time to reduce down-time following such a failure in the future? Is that worth doing?*

What can be done to reduce the risk of losing the data due to local failures, such as this one? Is that worth doing?*

* If not worth investing in prevention and risk reduction, clarify the risks being taken and adjust associated service expectations so as not to put an undue strain on IT support resources.

Notes

On 10/2/13 (Wed), Matrix became unavailable.

System has 8 drives on a single hardware controller.

  • 2 (150GB) hard drives for the OS, RAID 1 ("6" and "7").
  • 6 (3 TB) hard drives for the data storage ("0" - "5").

Drives for OS:

Disk number

6

7

Notes

OK

OK

  • Status: All seems fine
  • RAID 1 over two 150GB drives. Thus, have access to just under 150GB.

Drives for "Data":

Disk number

0

1

2

3

4

5

Notes

degraded?

OK

degraded?

ECC_Error

OK

OK

  • Status: 2-3 drives are suspect. Something wrong with hardware controller instead?!
  • RAID 6 over 3TB drives. Thus, have access to 12 TB (of the 18TB total).
  • Two partitions, 5.4TB (of 6TB theoretical) each:
    • One with the data itself (using 3.1TB of 5.4TB space)
    • The other with versioned copies of the data (full, of 5.4TB of space)

Fedora 13: Old OS. This means some tools we want to use that were created for use with a contemporary OS  (some of which we've successfully used elsewhere) may not work. Ex:

  • iSCSI to more quickly moved data to a different hard disk array. And to give us more flexible options to pull the vulnerable data.
  • Tool to better monitor the hardware disk controller, without requiring system reboots.
    • Reboots are to be avoided when we can see the data because the data was not consistently visible after every reboot.
  • No labels