You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 4 Next »

Matrix currently unavailable due to a problem with its data storage.

Situation

As of 10/3/13, 3pm:

Addressing this issue is expected to take days, not hours. Just copying the data takes a day or more, much less doing all the work required to diagnose and solve the problem.

We still hopes that no data that was on the data storage system has been irretrievably lost. However, the situation is precarious since two fail-safes have failed.

  • Since system can only accommodate a loss of 2 hard drives (of 6), we are now at high risk since 2 of the hard drives seemed to have failed. And a third is now issuing warning signs.

Status

10/3/13, afternoon: ChemIT has placed an order for 3TB for Matrix.

  • ChemIT has also ordered, for its own general use, a 4TB consumer drive. This can be use for a short-term backup in this situation to further decrease risk of data loss by enable yet one more copy of the data.

10/3/13, 2:45pm: 3TB hard drive approved by Harold Scheraga. ChemIT has placed an order.

10/3/12, noonish: Using data on 4 hard disks, using RAID 6 to reconstitute data on 2 drives which test OK separately with a "quick" test.

  • 2% done after over 3 hours...

Plan

Make a copy

Do this before doing anything else. Any further activity may put the data at more risk.

Get a copy of the data off the system, as a precautionary measure. This process may take days, not hours.

  • Confirm that data copy is complete.
  • Further duplicate that data, especially before deleting any original data.

Work the problem

Install new hard drive. Reconstitute RAID 6 with this drive (removing one of the suspected drives).

Analyze further one of the two suspected hard drives to try to isolate source of data corruption. Is it the drive? Or, should we instead be looking at the hardware controller?

Notes

On 10/2/13 (Wed), Matrix became unavailable.

System has 8 drives on a single hardware controller.

  • 2 (150GB) hard drives for the OS, RAID 1 ("6" and "7").
  • 6 (3 TB) hard drives for the data storage ("0" - "5").

Drives for OS:

6

7

  • RAID 1
  • Status: All seems fine.

Drives for "Data":

0

1

2

3

4

5

  • RAID 6
  • Status: 2-3 drives are suspect. Something wrong with hardware controller instead?!
  • No labels