Excerpt |
---|
...
Summer 2013 and winter 2014, there were an inordinate number of power outages in Baker Lab, and other Chem buildings! |
See also
...
Date
...
Outage duration
...
Cause
...
Official link
...
ChemIT notes
...
2/27/2014
Thursday
...
Question
...
Procedural error?
Per 3/3/14 email: The outage was caused as a result of routine maintenance activities which were being conducted at the Campus' main substation which takes power from the NYSEG transmission system and provides it to campus. This work has been conducted many times before without incident but in this case caused a major disruption of electricity supply. Staff from Utilities and external technical resources are investigating the root cause of this unexpected event.
...
No link to info in 3/3/14 email?
Some delayed time-stamps, from the IT folks:
http://www.it.cornell.edu/services/alert.cfm?id=3072
Initial timing info, from the power folks:
http://www.cornell.edu/cuinfo/specialconditions/#2050
...
Michael led our effort to initially evaluate and restore systems, with Oliver adding to documentation and to-do's. Lulu completed the cluster restoration efforts.
Lost 1-2 hours each for Michael, Lulu, and Oliver.
Cornell called it a "power blip". In Oliver's books, any outage longer than seconds is not a "blip".
Q: Broke one GPU workstation?
...
1/27/2014
Monday
...
17-19 minutes
CU's record: Power outage at 2:22p. Restored around 2:41p.
Oliver's record: Power outage at 2:22p. Restored at 2:39p.
...
?
...
http://www.it.cornell.edu/services/alert.cfm?id=3040
...
Lulu, Michael, and Oliver shut down headnodes and other systems which were on UPS. (Those systems non UPS shut down hard, per usual.)
Lost 3 hours, for Lulu, Michael, and Oliver.
Roger away on vacation (out of the U.S.)
...
12/23/2013
Monday
...
2 minutes
CU's report: 08:36 AM - 8:38 AM
(ChemIT staff not in yet.)
...
Procedural error?
...
http://www.it.cornell.edu/services/alert.cfm?id=2982
...
Terrible timing, right before the longest staff holiday of the year.
ChemIT staff not present during failure.
Lost most of the day, for Roger and Oliver.
Michael Hint and Lulu way on vacation (out of the U.S.)
...
7/17/13
...
Half a morning (~2 hours)
CU's report: 8:45 AM - 10:45 AM
...
...
http://www.it.cornell.edu/services/alert.cfm?id=2711
...
Question: When power is initially restored, do you trust it? Or might it simply kick back off in some circumstances?
- Because we don't know the answer to this question following any specific power outage, we are reluctant to turn back on servers right away. Instead, we like to wait ~10-25 minutes.
What would it cost to UPS our research systems?
Assuming protection for 1-3 minutes MAXIMUM:
Do all headnodes and stand-alone computers in 248 Baker Lab
- Started getting done. About $170/ headnode every ~4 years.
...
...
Cluster
...
Done
...
Not done
...
Notes
...
Collum
...
X
Spring'14
...
...
Lancaster, with Crane (new)
...
X
Spring'14
...
...
Funded by Crane.
...
Hoffmann
...
X
Spring'14
...
...
...
Scheraga
...
X
Fall'14
...
...
See below chart for s4 tand-alone computational computers
...
Loring
...
...
X
...
Unique: Need to do ASAP
...
Abruna
...
...
X
...
Unique: Need to do ASAP
...
C4 Headnode: pilot
Widom's 2 nodes there.
...
X
Old UPS
...
...
Provisioned on the margin, since still a pilot.
(Not funded by Widom.)
...
Widom
...
...
X
...
See "C4", above
Stand-alone computers' UPS status:
Computer | Done | Note done | Notes |
---|---|---|---|
Scheraga's 4 GPU rack-mounted computational computers |
| X | Need to protect? |
NMR web-based scheduler | X |
|
|
Coates: MS SQL Server |
| X | Unique: Need to do ASAP |
Do all switches: Maybe ~$340 ($170*2), every ~4 years.
- Recommend: Do ASAP.
- Q: Funding?
- Other issues and concerns with actually implementing this approach:
- Rack space. Maybe power. Maybe cord lengths. What other issues?
Do all compute nodes: ~$18K every ~4 years.
- ~20 20amp UPS's ($900 each) required.
- Estimates are simply back-of-the-envelope calculations.
- If were to actually implement, there may be smarter ways to do this, but the total cost will likely not be lower.
- In fact, costs may be higher, if sufficiently higher benefit doing it a different way, for example.
- Issues and concerns with actually implementing this approach:
- Costs. Rack space. Maybe power. Maybe cord lengths. What other issues?
Compute node counts, for UPS pricing estimates. Does not include head node:
...
Cluster
...
Compute node count
...
Power strip equivalents
(~8/strip MAX)
...
Cost estimate,
every 4 years
...
Notes
...
Collum
...
8?
...
1
...
$900
...
Lancaster, with Crane (new)
...
12?
...
2
...
$1.8K
...
Hoffmann
...
14?
...
2
...
$1.8K
...
...
Also consider
- When communicate with others, and what gets communicated, for what purpose?
...
Scheraga
...
92?
...
13
...
$11.7K
...
Loring
...
6?
...
1
...
$900
...
Abruna
...
6?
...
1
...
$900
...
C4 Headnode: pilot
Widom's 2 nodes there.
...
N/A
...
N/A
...
N/A
...
This CCB Community headnode pilot has no compute nodes of its own.
...
Widom
...
2
...
1
...
?
...
Compute nodes are hanging off of "C4", above.
...
TOTALS
...
~140?
...
21
...
$18K + Widom
...
Procedures and reminders
Reboot switches?
- 8 switches support the clusters (out of 12 in the room).
Start
...
head nodes, if not on already
- Only a few are on UPS. Those can obviously be left on.
- None should be set to auto-start on power-off.
Confirm
...
head nodes accessible via SSH
PuTTY on Windows
Use FMPro to get connection info?! (not the right info there, though...)
Menu => List Machine / User Selections = > List Group or selected criteria
- Machine function => Server HN
Start compute nodes
If nodes done show up, consider:
...