Excerpt
A snapshot of the UPS's used to support servers, switches, and other equipment under ChemIT's management. Mostly within ChemIT's Baker 248 server room.

UPS inventory for CCB Clusters and non-cluster HPCs

Note 1: Currently none of the clusters ChemIT manages have UPSs for their compute nodes. Thus, this is our standard community standard of practice. (Is this what CAC does, too?).

ChemIT staff must be called in to restart the compute nodes after //any// power failure.
After even the briefest of power outages, all compute nodes will be off and thus clusters will be unusable. This is true even if the headnode and network switch have UPS backup.

Note 2: Having a UPS is expected to provide power backup for perhaps less than 10 minutes. (Depends on size of UPS, condition/ age of that UPS's battery, and the demands placed on that UPS.) This protects against most power outages.

Adding a USB connection from the UPS to a system allows that system to execute a shutdown command, properly shutting the system down in a timely manner during a prolonged power outage. Otherwise the system will simply lose power, and that kind of forced shutdown can often cause software and hardware failures.
Ideally, ChemIT will have the resources to expand our capabilities to enable shutting down UPS-protected systems during prolonged power outages running beyond the capacities of the UPS, beyond the systems with a direct USB connection to the UPS.

Upgrade status14: Within year, upgrade OS?
No h/w upgrades plannedSpring'14: Upgraded OS and added 2 nodes

Cluster name	UPS for main system or headnode	Maintenance window	UPS shutdown algorithm, if any	Tools used	Other notes
Abruna	NONE Unique: Need to do ASAP (no backup of OS!)	n/a4	n/	a
Ananth	(unknownUnknown)	n/a	n/a	Cluster managed by CAC, not ChemIT
Collum	Done Spring'14	Fall'13: Upgraded OS and added 2 nodes
Hoffmann	Done Spring'14	Winter'13/14: Upgraded OS and added 2 nodes
Lancaster (w/ Crane)	Done Spring'14 (Funded by Crane)
Loring	NONE Unique: Need to do ASAP (no backup of OS!)		4/14: When do OS upgrades, and why? No hardware upgrades planned	Merged with Widom cluster	n/a	n/a
Scheraga: Current, production Matrix	Scheraga	Done Fall'14 (See chart below for stand-alone computational (GPU) computers)		Summer'14: $50K hardware upgrades; to include OS upgrade.
Widom	Yes Waiting for head node to deploy (on C4 at the moment)		Spring'14: Upgraded OS and added 2 nodes
ChemIT (C4)	Done Old UPS; on the margin		4/14: When turn into production, and for whom?
Totals:

CCB non-cluster HPCs, summary information

Inventory and summary notes regarding non-cluster HPC systems in 248, including computational stand-alone systems.

Columns needed - software installed / managed (do per system; dedicated page?), Cores, Age, storage and related (RAID: h/w or s/w?)

as-chm-bair-08.ad.cornell.edu ChemIT 10.253.229.196/192.

Name of system, and purpose	ChemIT / other support	DNS name (may have CNAME)	IP Gateway	ChemIT Network	Headnode IPMI Network	OS	OS Version	UPS	Maintenance window	Upgrade status

Scheraga: Forthcoming Matrix	Done Fall'14 (See below for stand-alone computational (GPU) computers, if any are to remain as such)	UPS supporting both Synology storage system and headnode. UPS USB-connected to Synology storage system. Synology thus sends a signal to headnode. Algorithms are: Synology: Headnode:	Synology's own s/w. On Linux systems, running "nut".
Widom (w/ Loring)	Done April 2016 head node to deploy finished			Moved Widom HeadNode to Loring UPS
ChemIT (C4)	Done			Moved C4 to Loring UPS
Baird: 1 rack-mounted computational computer	ChemIT	compute.baird.chem.cornell.edu	NONE ChemIT recommended making this investment (standard of practice), but group decided explicitly not to make the investment.	n/a	n/a	10.253.229.178	192.168.255.120	192.168.255.121	Windows Server	2012R2	NONE ( Suggested but not done )
Freed: Eldor		eldor.acert.chem.cornell.edu	10.253.229.96	192.168.255.87		CentOS	6.4	NONE		NONE	n/a	n/a
Petersen: 2 rack-mounted computational computers	ChemIT	calc01.petersen.chem.cornell.edu calc02.petersen.chem.cornell.edu				Windows Server	2012R2	Yes, but needs to be deployed in true production; using Widom's UPS for now. ChemIT using UPS for testing UPS-related control software			.	UPS supporting both system #50 and system #51. UPS is USB-connected to system #50, which itself does not send signal to system #51. Algorithms for System #50 is: Shutdown if only 10% battery power is left. (System #51 currently does not have a way to be shutdown properly if there is a prolonged power outage.)	Windows OS	ChemIT would like to: Establish sending a signal from system #50 to system #51 and have system #51 properly shut down in the event of a prolonged outage.
Scheraga: 4 GPU rack-mounted computational computers	ChemIT	gpu.scheraga.chem.cornell.edu	10.253.229.70	192.168.255.139	192.168.255.138	CentOS	6.4	NONE ($900, estimate) Need to protect? Data point: Feb'14 outage resulted in one of these not booting up correctly.	n/a	n/a

Power outage impact on systems with and without UPS

~5-10 minute outage from Sunday, 4/23/207, per Michael Hint's investigations

Group or server	UPS info (details in above table)	Impact of outage: Headnode or main server	Impact of outage: Storage	Impact of outage: Other
Chemistry IT: SERV-05: HyperV production hosts: Stockroom QB, Stockroom WebApp, ChemIT file share, test WSUS. (Dell, rack)	Worthless: Died within 2 minutes. (Was a hand-me-down)	FAILED		Plan: All but ChemIT file share going to AWS.
Chemistry IT: SERV-05: HyperV backup. (RedBarn, rack)	Worthless: Died within 2 minutes. (Was a hand-me-down)	FAILED
RESE-01: HyperV hosts to CRANE-19 (NFS) Crane Synology	Survived	Fine	Fine
Scheraga Matrix headnode

...

Scheraga Matrix Synology	Survived	Fine	Fine	(down)
Hoffmann	Survived	Fine	n/a	(down)	Router config reset, so failed
Lancaster- Crane	Survived	Fine	n/a	(down)
Widom-Loring-Abruna	Survived	Fine	n/a	bw001 up, since part of twin head node (all the rest were down)
Baird compute server	No UPS	Down (MH restarted remotely via IPMI)	n/a
Petersen	Survived
Freed's Eldor	?

What does it cost to UPS

...

a research

...

system?

Assuming protection for 1-3 minutes MAXIMUM:

Current goal: Do all head nodes and stand-alone computers in 248 Baker Lab. Started getting done, fall 2014.

Assuming protection for 1-3 minutes MAXIMUM:

About $180 (APC brand) per head node or server every ~3-4 years (3 yr for warranty and ~4 years actual battery life).

Unusual case: And ~$900/ set of 4 GPU systems (Scheraga) every ~3-4 years (to for this, must confirm approach, appropriateness, and estimates).

...

Remaining UPS's to invest in

...

Most we been done Spring 14, after the spate of power failures. See CCB's HPC page (first chart, in "UPS for headnode" column) for details

Cluster	Done	Not done	Notes	Loring
X	Unique: Need to do ASAP	Abruna		X	Unique: Need to do ASAP

...

See CCB's HPC page (second chart, in "UPS" column) and CCB's non-HPC page (in "UPS" column) for details of the few that are already done.

...

Computer	Note done	Notes
Coates: MS SQL Server	X	Unique: Need to do ASAP
Freed: Eldor	X	Unique: Need to do ASAP? (Q: Is OS backed up?)
Baird: 1 rack-mounted computational computer	X	Need?

Review Review others at above two cited pages which might need a UPS, after above ones done.

...

Space shortcuts

Child pages

Versions Compared

Old Version 2

New Version Current

Key

See also

UPS inventory for CCB Clusters and non-cluster HPCs

CCB non-cluster HPCs, summary information

Power outage impact on systems with and without UPS

What does it cost to UPS

a research

system?

Remaining UPS's to invest in

Space shortcuts

Child pages

Page History

Versions Compared

Old Version 2

New Version Current

Key

See also

UPS inventory for CCB Clusters and non-cluster HPCs

CCB non-cluster HPCs, summary information

Power outage impact on systems with and without UPS

What does it cost to UPS

a research

system?

Remaining UPS's to invest in