Too many power outages in Baker Lab, and other Chem buildings!

ChemIT's record of recent power outages

Date	Outage duration	Cause	Official link	ChemIT notes
2/27/2014 Thursday	2 minutes CU's record: Per 3/3/14 email: On Thursday 2/27/14 at 9:50am the campus experienced a disruption of electricity. The whole campus experienced an approximately 30 second outage. Many buildings on the central, west and north campus areas remained without power for 30 minutes to an hour. Oliver's record: Power outage at 9:53a. Restored at 9:55a (more than one minute).	Procedural error? Per 3/3/14 email: The outage was caused as a result of routine maintenance activities which were being conducted at the Campus' main substation which takes power from the NYSEG transmission system and provides it to campus. This work has been conducted many times before without incident but in this case caused a major disruption of electricity supply. Staff from Utilities and external technical resources are investigating the root cause of this unexpected event.	No link to info in 3/3/14 email? Some delayed time-stamps, from the IT folks: http://www.it.cornell.edu/services/alert.cfm?id=3072 Initial timing info, from the power folks: http://www.cornell.edu/cuinfo/specialconditions/#2050	Michael led our effort to initially evaluate and restore systems, with Oliver adding to documentation and to-do's. Lulu completed the cluster restoration efforts. Lost 1-2 hours each for Michael, Lulu, and Oliver. Cornell called it a "power blip". In Oliver's books, any outage longer than seconds is not a "blip". Q: Broke one GPU workstation?
1/27/2014 Monday	17-19 minutes CU's record: Power outage at 2:22p. Restored around 2:41p. Oliver's record: Power outage at 2:22p. Restored at 2:39p.	?	http://www.it.cornell.edu/services/alert.cfm?id=3040	Lulu, Michael, and Oliver shut down head nodes and other systems which were on UPS. (Those systems non UPS shut down hard, per usual.) Lost 3 hours, for Lulu, Michael, and Oliver. Roger away on vacation (out of the U.S.)
12/23/2013 Monday	2 minutes CU's report: 08:36 AM - 8:38 AM (ChemIT staff not in yet.)	Procedural error?	http://www.it.cornell.edu/services/alert.cfm?id=2982	Terrible timing, right before the longest staff holiday of the year. ChemIT staff not present during failure. Lost most of the day, for Roger and Oliver. Michael Hint and Lulu way on vacation (out of the U.S.)
7/17/13	Half a morning (~2 hours) CU's report: 8:45 AM - 10:45 AM		http://www.it.cornell.edu/services/alert.cfm?id=2711

Question: When power is initially restored, do you trust it? Or might it simply kick back off in some circumstances?

Because we don't know the answer to this question following any specific power outage, we are reluctant to turn back on servers right away. Instead, we like to wait ~10-25 minutes.

What would it cost to UPS our research systems?

Assuming protection for 1-3 minutes MAXIMUM:

Do all head nodes and stand-alone computers in 248 Baker Lab

Started getting done. About $180 (APC brand) per head node or server every ~3-4 years (3 yr for warranty and ~4 years actual battery life). And ~$900/ set of 4 GPU systems every ~3-4 years (to confirm approach and estimates).

CCB head nodes' UPS status:

Remaining UPS's to invest in

Clusters

Most we been done Spring 14, after the spate of power failures. See CCB's HPC page (first chart, in "UPS for headnode" column) for details

Cluster	Done	Not done	Notes
Loring		X	Unique: Need to do ASAP
Abruna		X	Unique: Need to do ASAP

Non-clusters

See CCB's HPC page (second chart, in "UPS" column) and CCB's non-HPC page (in "UPS" column) for details of the few that are already done.

Stand-alone computers' UPS status:

Computer	Note done	Notes
Coates: MS SQL Server	X	Unique: Need to do ASAP
Freed: Eldor	X	Unique: Need to do ASAP? (Q: Is OS backed up?)
Baird: 1 rack-mounted computational computer	X	Need?

Review others at above two cited pages which might need a UPS, after above ones done.

Switches

Do all switches: Maybe ~$340 ($170*2), every ~4 years.

Recommend: Do ASAP.
Q: Funding?
Other issues and concerns with actually implementing this approach:
- Rack space. Maybe power. Maybe cord lengths. What other issues?

Do all compute nodes: ~$18K initially, and perhaps ~$4.5K every ~4 years to replace batteries and deal with UPS hardware failures.

~20 20amp UPS's ($900 each) required.
- Replacement batteries ~$200 each, or ~1/4 replacement cost.
Estimates are simply back-of-the-envelope calculations.
If were to actually implement, there may be smarter ways to do this, but the total cost will likely not be lower.
- In fact, costs may be higher, if sufficiently higher benefit doing it a different way, for example.
Issues and concerns with actually implementing this approach:
- Costs. Rack space. Maybe power. Maybe cord lengths. What other issues?

Compute node counts, for UPS pricing estimates. Does not include head node:

Count source: ChemIT's Computer counts with CCB clusters

Cluster	Compute node count	Power strip equivalents (~8/strip MAX)	Cost estimate, every 4 years	Notes
Collum	8	1	$900
Lancaster, with Crane (new)	10	2	$1.8K
Hoffmann	19	2	$1.8K
Scheraga	91	13	$11.7K
Loring	4	1	$900
Abruna	9	1	$900
C4 head node: pilot Widom's 2 nodes there.	N/A	N/A	N/A	This CCB Community head node pilot has no compute nodes of its own. It hosts compute nodes from CCB researchers.
Widom	2	1	?	Compute nodes are hanging off of "C4" head node, above.
TOTALS	~140?	21	$18K + Widom

Procedures and reminders

Reboot switches?

8 switches support the clusters (out of 12 in the room).

Start head nodes, if not on already

Only a few are on UPS. Those can obviously be left on.
None should be set to auto-start on power-off.

Confirm head nodes accessible via SSH

PuTTY on Windows

Use FMPro to get connection info?! (not the right info there, though...)

Menu => List Machine / User Selections = > List Group or selected criteria

Machine function => Server HN

Start compute nodes

If nodes done show up, consider:

Restart Torque scheduler on problematic nodes.
Try rebooting the switch the affected nodes are connected to, especially if the problematic nodes are grouped to a single switch.
Hook up a monitor as one of the high nodes boot.

Space shortcuts

Child pages

ChemIT's record of recent power outages

What would it cost to UPS our research systems?

Remaining UPS's to invest in

Clusters

Non-clusters

Switches

Procedures and reminders

Reboot switches?

Start head nodes, if not on already

Confirm head nodes accessible via SSH

PuTTY on Windows

Use FMPro to get connection info?! (not the right info there, though...)

Start compute nodes

Space shortcuts

Child pages

Power outage records, procedures, and to-do's

ChemIT's record of recent power outages

What would it cost to UPS our research systems?

Remaining UPS's to invest in

Clusters

Non-clusters

Switches

Procedures and reminders

Reboot switches?

Start head nodes, if not on already

Confirm head nodes accessible via SSH

PuTTY on Windows

Use FMPro to get connection info?! (not the right info there, though...)

Start compute nodes