Too many power outages in Baker Lab, and other Chem buildings!
ChemIT's record of recent power outages
Date |
Outage duration |
Cause |
Official link |
ChemIT notes |
---|---|---|---|---|
2/27/2014 |
2 minutes |
Procedural error? |
No link to info in 3/3/14 email? |
Michael led our effort to initially evaluate and restore systems, with Oliver adding to documentation and to-do's. Lulu completed the cluster restoration efforts. |
1/27/2014 |
17-19 minutes |
? |
Lulu, Michael, and Oliver shut down head nodes and other systems which were on UPS. (Those systems non UPS shut down hard, per usual.) |
|
12/23/2013 |
2 minutes |
Procedural error? |
Terrible timing, right before the longest staff holiday of the year. |
|
7/17/13 |
Half a morning (~2 hours) |
|
|
Question: When power is initially restored, do you trust it? Or might it simply kick back off in some circumstances?
- Because we don't know the answer to this question following any specific power outage, we are reluctant to turn back on servers right away. Instead, we like to wait ~10-25 minutes.
What would it cost to UPS our research systems?
Assuming protection for 1-3 minutes MAXIMUM:
Do all head nodes and stand-alone computers in 248 Baker Lab
- Started getting done. About $170/ head node every ~4 years. And ~$900/ set of 4 GPU systems every ~4 years (to confirm approach).
CCB head nodes' UPS status:
Remaining UPS's to invest in
Most have been done Spring 14, after the spate of power failures. For details on clusters and non-clusters respectively, see:
[CCB HPCs
../../../../../../../../../../display/chemit/CCB+Clusters+Information]
- CCB non-HPCs
Clusters
Cluster |
Done |
Not done |
Notes |
---|---|---|---|
Loring |
|
X |
Unique: Need to do ASAP |
Abruna |
|
X |
Unique: Need to do ASAP |
Non-clusters
Stand-alone computers' UPS status:
Computer |
Done |
Note done |
Notes |
---|---|---|---|
Scheraga's 4 GPU rack-mounted computational computers |
|
X |
Need to protect? |
NMR web-based scheduler |
X |
|
|
Coates: MS SQL Server |
|
X |
Unique: Need to do ASAP |
Freed: Eldor |
|
X |
|
Do all switches: Maybe ~$340 ($170*2), every ~4 years.
- Recommend: Do ASAP.
- Q: Funding?
- Other issues and concerns with actually implementing this approach:
- Rack space. Maybe power. Maybe cord lengths. What other issues?
Do all compute nodes: ~$18K initially, and perhaps ~$4.5K every ~4 years to replace batteries and deal with UPS hardware failures.
- ~20 20amp UPS's ($900 each) required.
- Replacement batteries ~$200 each, or ~1/4 replacement cost.
- Estimates are simply back-of-the-envelope calculations.
- If were to actually implement, there may be smarter ways to do this, but the total cost will likely not be lower.
- In fact, costs may be higher, if sufficiently higher benefit doing it a different way, for example.
- Issues and concerns with actually implementing this approach:
- Costs. Rack space. Maybe power. Maybe cord lengths. What other issues?
Compute node counts, for UPS pricing estimates. Does not include head node:
- Count source: ChemIT's Computer counts with CCB clusters
Cluster |
Compute node count |
Power strip equivalents |
Cost estimate, |
Notes |
---|---|---|---|---|
Collum |
8 |
1 |
$900 |
|
Lancaster, with Crane (new) |
10 |
2 |
$1.8K |
|
Hoffmann |
19 |
2 |
$1.8K |
|
Scheraga |
91 |
13 |
$11.7K |
|
Loring |
4 |
1 |
$900 |
|
Abruna |
9 |
1 |
$900 |
|
C4 head node: pilot |
N/A |
N/A |
N/A |
This CCB Community head node pilot has no compute nodes of its own. |
Widom |
2 |
1 |
? |
Compute nodes are hanging off of "C4" head node, above. |
TOTALS |
~140? |
21 |
$18K + Widom |
|
Procedures and reminders
Reboot switches?
- 8 switches support the clusters (out of 12 in the room).
Start head nodes, if not on already
- Only a few are on UPS. Those can obviously be left on.
- None should be set to auto-start on power-off.
Confirm head nodes accessible via SSH
PuTTY on Windows
Use FMPro to get connection info?! (not the right info there, though...)
Menu => List Machine / User Selections = > List Group or selected criteria
- Machine function => Server HN
Start compute nodes
If nodes done show up, consider:
- Restart Torque scheduler on problematic nodes.
- Try rebooting the switch the affected nodes are connected to, especially if the problematic nodes are grouped to a single switch.
- Hook up a monitor as one of the high nodes boot.