You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 12 Next »

Too many power outages in Baker Lab, and other Chem buildings!

Date

Outage duration

Cause

Official link

ChemIT notes

2/27/2014
Thursday

2 minutes
Oliver's record: Power outage at 9:53p. Restored at 9:55p.

?

No official info on when the power outage occurred, or its duration.
Some delayed time-stamps, from the IT folks:
http://www.it.cornell.edu/services/alert.cfm?id=3072
Very little timing info, from the power folks:
http://www.cornell.edu/cuinfo/specialconditions/#2050

Michael led our effort to initially evaluate and restore systems, with Oliver adding to documentation and to-do's. Lulu completed the cluster restoration efforts.
Lost 1-2 hours each for Michael, Lulu, and Oliver.
Cornell called it a "power blip". In Oliver's books, any outage longer than seconds is not a "blip".

1/27/2014
Monday

17-19 minutes
CU's record: Power outage at 2:22p. Restored around 2:41p.
Oliver's record: Power outage at 2:22p. Restored at 2:39p.

?

http://www.it.cornell.edu/services/alert.cfm?id=3040

Lulu, Michael, and Oliver shut down headnodes and other systems which were on UPS. (Those systems non UPS shut down hard, per usual.)
Lost 3 hours, for Lulu, Michael, and Oliver.
Roger away on vacation (out of the U.S.)

12/23/2013
Monday

2 minutes
CU's report: 08:36 AM - 8:38 AM
(ChemIT staff not in yet.)

Human error?

http://www.it.cornell.edu/services/alert.cfm?id=2982

Terrible timing, right before the longest staff holiday of the year.
ChemIT staff not present during failure.
Lost most of the day, for Roger and Oliver.
Michael Hint and Lulu way on vacation (out of the U.S.)

7/17/13

Half a morning (~2 hours)
CU's report: 8:45 AM - 10:45 AM

 

http://www.it.cornell.edu/services/alert.cfm?id=2711

 

Question: When power is initially restored, do you trust it? Or might it simply kick back off in some circumstances?

  • Because we don't know the answer to this question following any specific power outage, we are reluctant to turn back on servers right away. Instead, we like to wait ~10-25 minutes.

Reboot switches

  • 8 switches support the clusters (out of 12 in the room).

Start headnodes, if not on already

  • Only a few are on UPS. Those can obviously be left on.
  • None should be set to auto-start on power-off.

Confirm headnodes accessible via SSH

PuTTY on Windows

Use FMPro to get connection info?! (not the right info there, though...)

Menu => List Machine / User Selections = > List Group or selected criteria

  • Machine function => Server HN

Start compute nodes

If nodes done show up, consider:

  • Restart Torque scheduler on problematic nodes.
  • Try rebooting the switch the affected nodes are connected to, especially if the problematic nodes are grouped to a single switch.
  • Hook up a monitor as one of the high nodes boot.
  • No labels