Excerpt |
---|
...
Summer 2013 and winter 2014, there were an inordinate number of power outages in Baker Lab, and other Chem buildings! |
Date | Outage duration | Cause | Official link | ChemIT notes |
---|---|---|---|---|
|
|
|
|
|
1/27/2014 | 17-19 minutes | ? | Lulu, Michael, and Oliver shut down headnodes and other systems which were on UPS. (Those systems non UPS shut down hard, per usual.) | |
12/23/2013 | 2 minutes | Human error? | http://www.it.cornell.edu/services/alert.cfm?id=2982 | Terrible timing, right before the longest staff holiday of the year. |
See also
- UPS inventory and status, mostly within ChemIT's Baker 248 server room
- ChemIT's record of recent power outages
Question
Question: When power is initially restored, do you trust it? Or might it simply kick back off in some circumstances?
- Because we don't know the answer to this question following any specific power outage, we are reluctant to turn back on servers right away. Instead, we like to wait ~10-25 minutes.
...
Also consider
- When communicate with others, and what gets communicated, for what purpose?
Procedures and reminders
Reboot switches?
- 8 switches support the clusters (out of 12 in the room).
Start
...
head nodes, if not on already
- Only a few are on UPS. Those can obviously be left on.
- None should be set to auto-start on power-off.
Confirm
...
head nodes accessible via SSH
PuTTY on Windows
Use FMPro to get connection info?! (not the right info there, though...)
Menu => List Machine / User Selections = > List Group or selected criteria
- Machine function => Server HN
Start compute nodes
If nodes done show up, consider:
...