Too many power outages in Baker Lab!

Date	Outage duration	Cause	Official link	ChemIT notes

1/27/2014	17-19 minutes CU's record: Power outage at 2:22. Restored around 2:41. Oliver's record: Power outage at 2:22. Restored at 2:39.	?	http://www.it.cornell.edu/services/alert.cfm?id=3040	Lulu, Michael, and Oliver shut down headnodes and other systems which were on UPS. (Those systems non UPS shut down hard, per usual.) Lost 3 hours, for Lulu, Michael, and Oliver. Roger away on vacation (out of the U.S.)
12/24/2013	Seconds to minutes?	Human error?		Terrible timing, right before the longest staff holiday of the year. Lost a day, for Roger and Oliver. Michael Hint and Lulu way on vacation (out of the U.S.)

Question: When power is initially restored, do you trust it? Or might it simply kick back off in some circumstances?

Because we don't know the answer to this question following any specific power outage, we are reluctant to turn back on servers right away. Instead, we like to wait ~10-25 minutes.

Reboot switches

Start headnodes, if not on already

Menu => List Machine / User Selections = > List Group or selected criteria

If nodes done show up, consider:

Restart Torque scheduler on problematic nodes.
Try rebooting the switch the affected nodes are connected to, especially if the problematic nodes are grouped to a single switch.
Hook up a monitor as one of the high nodes boot.