A snapshot of the UPS's used to support servers, switches, and other equipment under ChemIT's management. Mostly within ChemIT's Baker 248 server room. |
Note 1: Currently none of the clusters ChemIT manages have UPSs for their compute nodes. Thus, this is our standard community standard of practice. (Is this what CAC does, too?).
Note 2: Having a UPS is expected to provide power backup for perhaps less than 10 minutes. (Depends on size of UPS, condition/ age of that UPS's battery, and the demands placed on that UPS.) This protects against most power outages.
Cluster name | UPS for | UPS shutdown algorithm, if any | Tools used | Other notes |
---|---|---|---|---|
NONE | n/a | n/a | ||
(Unknown) | n/a | n/a | Cluster managed by CAC, not ChemIT | |
Done Spring'14 |
| |||
Done Spring'14 |
| |||
Done Spring'14 |
| |||
Merged with Widom cluster | n/a | n/a | ||
Scheraga: Current, production Matrix | Done Fall'14 |
| ||
Scheraga: Forthcoming Matrix | Done Fall'14 | UPS supporting both Synology storage system and headnode. UPS USB-connected to Synology storage system. Synology thus sends a signal to headnode. Algorithms are:
| Synology's own s/w. On Linux systems, running "nut". | |
Widom (w/ Loring) | Done April 2016 |
| Moved Widom HeadNode to Loring UPS
| |
ChemIT (C4) | Done |
| Moved C4 to Loring UPS | |
Baird: 1 rack-mounted computational computer | NONE | n/a | n/a | |
Freed: Eldor | NONE | n/a | n/a | |
Petersen: 2 rack-mounted computational computers | Yes, but needs to be deployed in true production; using Widom's UPS for now. | UPS supporting both system #50 and system #51. UPS is USB-connected to system #50, which itself does not send signal to system #51. Algorithms for System #50 is: Shutdown if only 10% battery power is left. (System #51 currently does not have a way to be shutdown properly if there is a prolonged power outage.) | Windows OS | ChemIT would like to: Establish sending a signal from system #50 to system #51 and have system #51 properly shut down in the event of a prolonged outage. |
Scheraga: 4 GPU rack-mounted computational computers | NONE | n/a | n/a |
~5-10 minute outage from Sunday, 4/23/207, per Michael Hint's investigations
Group or server | UPS info (details in above table) | Impact of outage: Headnode or main server
| Impact of outage: Storage | Impact of outage: Compute nodes (expect "down") | Impact of outage: Other |
---|---|---|---|---|---|
Chemistry IT: SERV-05: HyperV production hosts: Stockroom QB, Stockroom WebApp, ChemIT file share, test WSUS. (Dell, rack)
| Worthless: Died within 2 minutes. (Was a hand-me-down) | FAILED | Plan: All but ChemIT file share going to AWS. | ||
Chemistry IT: SERV-05: HyperV backup. (RedBarn, rack) | Worthless: Died within 2 minutes. (Was a hand-me-down) | FAILED | |||
RESE-01: HyperV hosts to CRANE-19 (NFS) Crane Synology | Survived | Fine | Fine | ||
Scheraga Matrix headnode Scheraga Matrix Synology | Survived | Fine | Fine | (down) | |
Hoffmann | Survived | Fine | n/a | (down) | Router config reset, so failed |
Lancaster- Crane | Survived | Fine | n/a | (down) | |
Widom-Loring-Abruna | Survived | Fine | n/a | bw001 up, since part of twin head node (all the rest were down) | |
Baird compute server | No UPS | Down (MH restarted remotely via IPMI) | n/a | ||
Petersen | Survived | ||||
Freed's Eldor | ? |
Current goal: Do all head nodes and stand-alone computers in 248 Baker Lab. Started getting done, fall 2014.
Assuming protection for 1-3 minutes MAXIMUM:
About $180 (APC brand) per head node or server every ~3-4 years (3 yr for warranty and ~4 years actual battery life).
Most we been done Spring 14, after the spate of power failures. See CCB's HPC page (first chart, in "UPS for headnode" column) for details
Cluster | Done | Not done | Notes |
---|---|---|---|
Abruna |
| X | Unique: Need to do ASAP |
See CCB's HPC page (second chart, in "UPS" column) and CCB's non-HPC page (in "UPS" column) for details of the few that are already done.
Stand-alone computers' UPS status:
Computer | Done | Note done | Notes |
---|---|---|---|
Coates: MS SQL Server |
| X | Unique: Need to do ASAP |
Freed: Eldor |
| X | Unique: Need to do ASAP? (Q: Is OS backed up?) |
Review others at above two cited pages which might need a UPS, after above ones done.
Do all switches: Maybe ~$340 ($170*2), every ~4 years.
Do all compute nodes: ~$18K initially, and perhaps ~$4.5K every ~4 years to replace batteries and deal with UPS hardware failures.
Compute node counts, for UPS pricing estimates. Does not include head node:
Cluster | Compute node count | Power strip equivalents | Cost estimate, | Notes |
---|---|---|---|---|
Collum | 8 | 1 | $900 |
|
Lancaster, with Crane (new) | 10 | 2 | $1.8K |
|
Hoffmann | 19 | 2 | $1.8K |
|
Scheraga | 91 | 13 | $11.7K |
|
Loring | 4 | 1 | $900 |
|
Abruna | 9 | 1 | $900 |
|
C4 head node: pilot | N/A | N/A | N/A | This CCB Community head node pilot has no compute nodes of its own. |
Widom | 2 | 1 | ? | Compute nodes are hanging off of "C4" head node, above. |
TOTALS | ~140? | 21 | $18K + Widom |
|