Summer 2013 and winter 2014, there were an inordinate number of power outages in Baker Lab, and other Chem buildings!

See also

Question

When power is initially restored, do you trust it? Or might it simply kick back off in some circumstances?

  • Because we don't know the answer to this question following any specific power outage, we are reluctant to turn back on servers right away. Instead, we like to wait ~10-25 minutes.

 


Also consider

  • When communicate with others, and what gets communicated, for what purpose?

Procedures and reminders

Reboot switches?

  • 8 switches support the clusters (out of 12 in the room).

Start head nodes, if not on already

  • Only a few are on UPS. Those can obviously be left on.
  • None should be set to auto-start on power-off.

Confirm head nodes accessible via SSH

PuTTY on Windows

Use FMPro to get connection info?! (not the right info there, though...)

Menu => List Machine / User Selections = > List Group or selected criteria

  • Machine function => Server HN

Start compute nodes

If nodes done show up, consider:

  • Restart Torque scheduler on problematic nodes.
  • Try rebooting the switch the affected nodes are connected to, especially if the problematic nodes are grouped to a single switch.
  • Hook up a monitor as one of the high nodes boot.
  • No labels