What are the cost trade-offs of reducing server hard drive failures? What can be done to mitigate consequences to drive failures?

See also

Can we reduce the chance of hard drives failing, and at what cost/ benefit?

Ideas and our current thoughts

Invest in more reliable disk drives

More reliable disk drives cost more money. Fortunately we tend to avoid plain consumer grade ones for the servers and invest in ones like WD's "Red" (good) and "Black" (better).

  • On 3/27/13, we bought a WD "Black" 2TB for $160-170.

Buy Solid State Drives (SSD's).

Consider doing this whenever the smaller size is acceptable, and cost for that smaller size is also acceptable.

From Roger,re: False disk failures:

http://www.techspot.com/news/52047-what-is-false-disk-failure-and-why-is-it-a-problem.html

  • 2/100 drives fail per year (it's actually much higher for the drives ChemIT uses!!)
    You can expect something like 1/100 of your drives to really fail this year. And you can expect another 1/100 of your drives to fail this year, but not actually be failed. You’ll still pay all the operational overhead of not actually having a failed drive – rebuilds, disk replacements, management interventions, scheduled downtime/maintenance time, and the OEM replacement price for that drive – what $600 or so?

Can we reduce the consequence of hard drives failing, and at what cost/ benefit?

Ideas and our current thoughts

Monitoring tools

Invest in learning how to better deploy and use monitoring tools. Some tools may cost money. Maybe not all relevent, but here are some buzz-words Oliver has come across:

  • Nagios
  • NetOps
  • S.M.A.R.T.

Data from a company using over 40,000 hard drives

Excerpt:

High-Level Summary

With 40,000 hard drives, Backblaze knows a lot about the reliability of hard drives and shares the statistics: 

  • 78% of drives survive more than 4 years.
  • The median hard drive survives 6 years.
  • Drives have 3 distinct failure modes that follow a bathtub curve:
    • Early “Infant Mortality” Failure
    • Constant (Random) Failure
    • Wear Out Failure
  • As long as the temperature is within spec, reliability is not affected by heat.
  • HGST drives are generally reliable; Seagate and Western Digital hard drives’ reliability vary by model. (see above cited web page for graphic)
  • No labels