You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 4 Next »

Too many power outages in Baker Lab!

Date

Outage duration

Cause

Official link

ChemIT notes

 

 

 

 

 

1/27/2014

17-19 minutes
CU's record: Power outage at 2:22. Restored around 2:41.
Oliver's record: Power outage at 2:22. Restored at 2:39.

?

http://www.it.cornell.edu/services/alert.cfm?id=3040

Lulu, Michael, and Oliver shut down headnodes and other systems which were on UPS. (Those systems non UPS shut down hard, per usual.)
Lost 3 hours, for Lulu, Michael, and Oliver.
Roger away on vacation (out of the U.S.)

12/24/2013

Seconds to minutes?

Human error?

 

Terrible timing, right before the longest staff holiday of the year.
Lost a day, for Roger and Oliver.
Michael Hint and Lulu way on vacation (out of the U.S.)

Question: When power is initially restored, do you trust it? Or might it simply kick back off in some circumstances?

  • Because we don't know the answer to this question following any specific power outage, we are reluctant to turn back on servers right away. Instead, we like to wait ~10-25 minutes.

Reboot switches

  • 8 switches support the clusters (out of 12 in the room).

Start headnodes, if not on already

  • Only a few are on UPS. Those can obviously be left on.
  • None should be set to auto-start on power-off.

Confirm headnodes accessible via SSH

PuTTY on Windows

Use FMPro to get connection info?! (not the right info there, though...)

Menu => List Machine / User Selections = > List Group or selected criteria

  • Machine function => Server HN

Start compute nodes

If nodes done show up, consider:

  • Restart Torque scheduler on problematic nodes.
  • Try rebooting the switch the affected nodes are connected to, especially if the problematic nodes are grouped to a single switch.
  • Hook up a monitor as one of the high nodes boot.
  • No labels