Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Date

Outage duration

Cause

Official link

ChemIT notes

2/27/2014
Thursday

2 minutes
CU's record: Per 3/3/14 email: On Thursday 2/27/14 at 9:50am the campus experienced a disruption of electricity.  The whole campus experienced an approximately 30 second outage.
Oliver's record: Power outage at 9:53a. Restored at 9:55a (more than one minute).

Procedural error?
Per 3/3/14 email: The outage was caused as a result of routine maintenance activities which were being conducted at the Campus' main substation which takes power from the NYSEG transmission system and provides it to campus.  This work has been conducted many times before without incident but in this case caused a major disruption of electricity supply.  Staff from Utilities and external technical resources are investigating the root cause of this unexpected event.

No link to info in 3/3/14 email?
Some delayed time-stamps, from the IT folks:
http://www.it.cornell.edu/services/alert.cfm?id=3072
Initial timing info, from the power folks:
http://www.cornell.edu/cuinfo/specialconditions/#2050

Michael led our effort to initially evaluate and restore systems, with Oliver adding to documentation and to-do's. Lulu completed the cluster restoration efforts.
Lost 1-2 hours each for Michael, Lulu, and Oliver.
Cornell called it a "power blip". In Oliver's books, any outage longer than seconds is not a "blip".
Q: Broke one GPU workstation?

1/27/2014
Monday

17-19 minutes
CU's record: Power outage at 2:22p. Restored around 2:41p.
Oliver's record: Power outage at 2:22p. Restored at 2:39p.

?

http://www.it.cornell.edu/services/alert.cfm?id=3040

Lulu, Michael, and Oliver shut down headnodes head nodes and other systems which were on UPS. (Those systems non UPS shut down hard, per usual.)
Lost 3 hours, for Lulu, Michael, and Oliver.
Roger away on vacation (out of the U.S.)

12/23/2013
Monday

2 minutes
CU's report: 08:36 AM - 8:38 AM
(ChemIT staff not in yet.)

Procedural error?

http://www.it.cornell.edu/services/alert.cfm?id=2982

Terrible timing, right before the longest staff holiday of the year.
ChemIT staff not present during failure.
Lost most of the day, for Roger and Oliver.
Michael Hint and Lulu way on vacation (out of the U.S.)

7/17/13

Half a morning (~2 hours)
CU's report: 8:45 AM - 10:45 AM

 

http://www.it.cornell.edu/services/alert.cfm?id=2711

 

...

Assuming protection for 1-3 minutes MAXIMUM:

Do all headnodes head nodes and stand-alone computers in 248 Baker Lab

  • Started getting done. About $170/ headnode every ~4 years.

CCB headnodes head nodes' UPS status:

Cluster

Done

Not done

Notes

Collum

X
Spring'14

 


Lancaster, with Crane (new)

X
Spring'14

 

Funded by Crane.

Hoffmann

X
Spring'14

 

 

Scheraga

X
Fall'14

 

See below chart for s4 tand-alone computational computers

Loring

 

X

Unique: Need to do ASAP

Abruna

 

X

Unique: Need to do ASAP

C4 Headnode: pilot
Widom's 2 nodes there.

X
Old UPS

 

Provisioned on the margin, since still a pilot.
(Not funded by Widom.)

Widom

 

X

See "C4", above

...

Cluster

Compute node count

Power strip equivalents
(~8/strip MAX)

Cost estimate,
every 4 years

Notes

Collum

8 ?

1

$900


Lancaster, with Crane (new)

10 12?

2

$1.8K


Hoffmann

19 14?

2

$1.8K

 

Scheraga

91 92?

13

$11.7K


Loring

4 6?

1

$900


Abruna

6? 9

1

$900


C4 Headnodehead node: pilot
Widom's 2 nodes there.

N/A

N/A

N/A

This CCB Community headnode head node pilot has no compute nodes of its own.

Widom

2

1

?

Compute nodes are hanging off of "C4" head node, above.

TOTALS

~140?

21

$18K + Widom

 

...

  • 8 switches support the clusters (out of 12 in the room).

Start headnodeshead nodes, if not on already

  • Only a few are on UPS. Those can obviously be left on.
  • None should be set to auto-start on power-off.

Confirm headnodes head nodes accessible via SSH

PuTTY on Windows

...