Too many power outages in Baker Lab, and other Chem buildings!
Date |
Outage duration |
Cause |
Official link |
ChemIT notes |
---|---|---|---|---|
2/27/2014 |
2 minutes |
? |
No official info on when the power outage occurred, or its duration. |
Michael led our effort to initially evaluate and restore systems, with Oliver adding to documentation and to-do's. Lulu completed the cluster restoration efforts. |
1/27/2014 |
17-19 minutes |
? |
Lulu, Michael, and Oliver shut down headnodes and other systems which were on UPS. (Those systems non UPS shut down hard, per usual.) |
|
12/23/2013 |
2 minutes |
Human error? |
Terrible timing, right before the longest staff holiday of the year. |
|
7/17/13 |
Half a morning (~2 hours) |
|
|
Question: When power is initially restored, do you trust it? Or might it simply kick back off in some circumstances?
- Because we don't know the answer to this question following any specific power outage, we are reluctant to turn back on servers right away. Instead, we like to wait ~10-25 minutes.
Reboot switches
- 8 switches support the clusters (out of 12 in the room).
Start headnodes, if not on already
- Only a few are on UPS. Those can obviously be left on.
- None should be set to auto-start on power-off.
Confirm headnodes accessible via SSH
PuTTY on Windows
Use FMPro to get connection info?! (not the right info there, though...)
Menu => List Machine / User Selections = > List Group or selected criteria
- Machine function => Server HN
Start compute nodes
If nodes done show up, consider:
- Restart Torque scheduler on problematic nodes.
- Try rebooting the switch the affected nodes are connected to, especially if the problematic nodes are grouped to a single switch.
- Hook up a monitor as one of the high nodes boot.