Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

...

Date

...

Outage duration

...

Cause

...

Official link

...

ChemIT notes

...

2/27/2014
Thursday

...

...

...

No link to info in 3/3/14 email?
Some delayed time-stamps, from the IT folks:
http://www.it.cornell.edu/services/alert.cfm?id=3072
Initial timing info, from the power folks:
http://www.cornell.edu/cuinfo/specialconditions/#2050

...

Michael led our effort to initially evaluate and restore systems, with Oliver adding to documentation and to-do's. Lulu completed the cluster restoration efforts.
Lost 1-2 hours each for Michael, Lulu, and Oliver.
Cornell called it a "power blip". In Oliver's books, any outage longer than seconds is not a "blip".
Q: Broke one GPU workstation?

...

1/27/2014
Monday

...

...

?

...

http://www.it.cornell.edu/services/alert.cfm?id=3040

...

...

...

2 minutes
CU's report: 08:36 AM - 8:38 AM
(ChemIT staff not in yet.)

...

Procedural error?

...

http://www.it.cornell.edu/services/alert.cfm?id=2982

...

Terrible timing, right before the longest staff holiday of the year.
ChemIT staff not present during failure.
Lost most of the day, for Roger and Oliver.
Michael Hint and Lulu way on vacation (out of the U.S.)

Question

...

7/17/13

...

Half a morning (~2 hours)
CU's report: 8:45 AM - 10:45 AM

...

 

...

http://www.it.cornell.edu/services/alert.cfm?id=2711

...

 

Question: When power is initially restored, do you trust it? Or might it simply kick back off in some circumstances?

  • Because we don't know the answer to this question following any specific power outage, we are reluctant to turn back on servers right away. Instead, we like to wait ~10-25 minutes.

  

...

Also consider

  • When communicate with others, and what gets communicated, for what purpose?

Procedures and reminders

Reboot switches?

...