Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Excerpt

Too many power outages in Baker Lab, and other Chem buildings!

ChemIT's record of recent power outages

Date

Outage duration

Cause

Official link

ChemIT notes

2/27/2014
Thursday

2 minutes
CU's record: Power outage at 9:50a.
Oliver's record: Power outage at 9:53a. Restored at 9:55a.

?

Some delayed time-stamps, from the IT folks:
http://www.it.cornell.edu/services/alert.cfm?id=3072
Initial timing info, from the power folks:
http://www.cornell.edu/cuinfo/specialconditions/#2050

Michael led our effort to initially evaluate and restore systems, with Oliver adding to documentation and to-do's. Lulu completed the cluster restoration efforts.
Lost 1-2 hours each for Michael, Lulu, and Oliver.
Cornell called it a "power blip". In Oliver's books, any outage longer than seconds is not a "blip".
Q: Broke one GPU workstation?

1/27/2014
Monday

17-19 minutes
CU's record: Power outage at 2:22p. Restored around 2:41p.
Oliver's record: Power outage at 2:22p. Restored at 2:39p.

?

http://www.it.cornell.edu/services/alert.cfm?id=3040

Lulu, Michael, and Oliver shut down headnodes and other systems which were on UPS. (Those systems non UPS shut down hard, per usual.)
Lost 3 hours, for Lulu, Michael, and Oliver.
Roger away on vacation (out of the U.S.)

12/23/2013
Monday

2 minutes
CU's report: 08:36 AM - 8:38 AM
(ChemIT staff not in yet.)

Human error?

http://www.it.cornell.edu/services/alert.cfm?id=2982

Terrible timing, right before the longest staff holiday of the year.
ChemIT staff not present during failure.
Lost most of the day, for Roger and Oliver.
Michael Hint and Lulu way on vacation (out of the U.S.)

7/17/13

Half a morning (~2 hours)
CU's report: 8:45 AM - 10:45 AM

 

http://www.it.cornell.edu/services/alert.cfm?id=2711

 

...

Cluster

Done

Not done

Notes

Collum

X
Spring'14

  Sprin


Lancaster, with Crane (new)

X
Spring'14

 

Funded by Crane.

Hoffmann

X
Spring'14

 

 

Scheraga

X
Fall'14

 

See below chart for s4 tand-alone computational computers

Loring

 

X

Unique: Need to do ASAP

Abruna

 

X

Unique: Need to do ASAP

C4 Headnode: pilot
Widom's 2 nodes there.

X
Old UPS

 

Provisioned on the margin, since still a pilot.
(Not funded by Widom.)

Widom

 

X

See "C4", above

...

Computer

Done

Note done

Notes

Scheraga's 4 GPU rack-mounted computational computers

 

X

Need to protect?
Data point: Feb'14 outage resulted in one of these not booting up correctly.

NMR web-based scheduler

X
Spring'14

 

 

Coates: MS SQL Server

 

X

Unique: Need to do ASAP

Do all switches: Maybe ~$ 340 ~$340 ($170*2), every ~4 years.

  • Recommend: Do ASAP.
  • Q: Funding?
  • Other issues and concerns with actually implementing this approach: Rack space. Maybe power. Maybe cord lengths. What other issues?

Do all compute nodes: ~$18K every  ~4 years.

  • ~20 20amp UPS's ($900 each) required.
  • These are from back-of-the-envelope calculations.
  • If were to actually implement, may be smarter ways to do this, but cost will likely not be lower. (In fact, costs may be higher, if sufficiently higher benefit doing it a different way, for example.
  • Issues and concerns with actually implementing this approach: Costs. Rack space. Maybe power. Maybe cord lengths. What other issues?

Compute node counts, for UPS pricing estimates. Does not include head node:

Cluster

Done

Not done

Notes

Collum

X
Spring'14

 

Sprin

Lancaster, with Crane (new)

X
Spring'14

 

Funded by Crane.

Hoffmann

X
Spring'14

 

 

Scheraga

X
Fall'14

 

See below chart for s4 tand-alone computational computers

Loring

 

X

Unique: Need to do ASAP

Abruna

 

X

Unique: Need to do ASAP

C4 Headnode: pilot
Widom's 2 nodes there.

X
Old UPS

 

Provisioned on the margin, since still a pilot.
(Not funded by Widom.)

Widom

 

X

See "C4", above

Procedures and reminders

Reboot switches?

  • 8 switches support the clusters (out of 12 in the room).

Start headnodes, if not on already

  • Only a few are on UPS. Those can obviously be left on.
  • None should be set to auto-start on power-off.

Confirm headnodes accessible via SSH

PuTTY on Windows

Use FMPro to get connection info?! (not the right info there, though...)

Menu => List Machine / User Selections = > List Group or selected criteria

  • Machine function => Server HN

Start compute nodes

If nodes done show up, consider:

...