What are the cost trade-offs of reducing server hard drive failures? What can be done to mitigate consequences to drive failures?
Can we reduce the chance of hard drives failing, and at what cost/ benefit?
Ideas and our current thoughts
Invest in more reliable disk drives
More reliable disk drives cost more money. Fortunately we tend to avoid plain consumer grade ones for the servers and invest in ones like WD's "Red" (good) and "Black" (better).
- On 3/27/13, we bought a WD "Black" 2TB for $160-170.
Buy Solid State Drives (SSD's).
Consider doing this whenever the smaller size is acceptable, and cost for that smaller size is also acceptable.
From Roger,re: False disk failures:
http://www.techspot.com/news/52047-what-is-false-disk-failure-and-why-is-it-a-problem.html
- 2/100 drives fail per year (it's actually much higher for the drives ChemIT uses!!)
You can expect something like 1/100 of your drives to really fail this year. And you can expect another 1/100 of your drives to fail this year, but not actually be failed. You’ll still pay all the operational overhead of not actually having a failed drive – rebuilds, disk replacements, management interventions, scheduled downtime/maintenance time, and the OEM replacement price for that drive – what $600 or so?
Can we reduce the consequence of hard drives failing, and at what cost/ benefit?
Ideas and our current thoughts
Monitoring tools
Invest in learning how to better deploy and use monitoring tools. Some tools may cost money. Maybe not all relevent, but here are some buzz-words Oliver has come across:
- Nagios
- NetOps
- S.M.A.R.T.
Data from a company using over 40,000 hard drives
https://www.backblaze.com/hard-drive.html
High-Level Summary
With 40,000 hard drives, Backblaze knows a lot about the reliability of hard drives and shares the statistics:
- 78% of drives survive more than 4 years.
- The median hard drive survives 6 years.
- Drives have 3 distinct failure modes that follow a bathtub curve:
- Early “Infant Mortality” Failure
- Constant (Random) Failure
- Wear Out Failure
- As long as the temperature is within spec, reliability is not affected by heat.
- HGST drives are generally reliable; Seagate and Western Digital hard drives’ reliability vary by model. (see web page for graphic)