Excerpt |
---|
A snapshot of the UPS's used to support servers, switches, and other equipment under ChemIT's management. Mostly within ChemIT's Baker 248 server room. |
See also
UPS inventory for CCB Clusters and non-cluster HPCs
Note 1: Currently none of the clusters ChemIT manages have UPSs for their compute nodes. Thus, this is our standard community standard of practice. (Is this what CAC does, too?).
- ChemIT staff must be called in to restart the compute nodes after //any// power failure.
- After even the briefest of power outages, all compute nodes will be off and thus clusters will be unusable. This is true even if the headnode and network switch have UPS backup.
Note 2: Having a UPS is expected to provide power backup for perhaps less than 10 minutes. (Depends on size of UPS, condition/ age of that UPS's battery, and the demands placed on that UPS.) This protects against most power outages.
- Adding a USB connection from the UPS to a system allows that system to execute a shutdown command, properly shutting the system down in a timely manner during a prolonged power outage. Otherwise the system will simply lose power, and that kind of forced shutdown can often cause software and hardware failures.
- Ideally, ChemIT will have the resources to expand our capabilities to enable shutting down UPS-protected systems during prolonged power outages running beyond the capacities of the UPS, beyond the systems with a direct USB connection to the UPS.
Cluster name | UPS for | UPS shutdown algorithm, if any | Tools used | Other notes |
---|---|---|---|---|
NONE | n/a | n/a | ||
(Unknown) | n/a | n/a | Cluster managed by CAC, not ChemIT | |
Done Spring'14 |
| |||
Done Spring'14 |
| |||
Done Spring'14 |
| |||
Merged with Widom cluster | n/a | n/a | ||
Scheraga: Current, production Matrix | Done Fall'14 |
| ||
Scheraga: Forthcoming Matrix | Done Fall'14 | UPS supporting both Synology storage system and headnode. UPS USB-connected to Synology storage system. Synology thus sends a signal to headnode. Algorithms are:
| Synology's own s/w. On Linux systems, running "nut". | |
Widom (w/ Loring) | Done April 2016 |
| Moved Widom HeadNode to Loring UPS
| |
ChemIT (C4) | Done |
| Moved C4 to Loring UPS | |
Baird: 1 rack-mounted computational computer | NONE | n/a | n/a | |
Freed: Eldor | NONE | n/a | n/a | |
Petersen: 2 rack-mounted computational computers | Yes, but needs to be deployed in true production; using Widom's UPS for now. | UPS supporting both system #50 and system #51. UPS is USB-connected to system #50, which itself does not send signal to system #51. Algorithms for System #50 is: Shutdown if only 10% battery power is left. (System #51 currently does not have a way to be shutdown properly if there is a prolonged power outage.) | Windows OS | ChemIT would like to: Establish sending a signal from system #50 to system #51 and have system #51 properly shut down in the event of a prolonged outage. |
Scheraga: 4 GPU rack-mounted computational computers | NONE | n/a | n/a |
Power outage impact on systems with and without UPS
~5-10 minute outage from Sunday, 4/23/207, per Michael Hint's investigations
Group or server | UPS info (details in above table) | Impact of outage: Headnode or main server
| Impact of outage: Storage | Impact of outage: Compute nodes (expect "down") | Impact of outage: Other |
---|---|---|---|---|---|
Chemistry IT: SERV-05: HyperV production hosts: Stockroom QB, Stockroom WebApp, ChemIT file share, test WSUS. (Dell, rack)
| Worthless: Died within 2 minutes. (Was a hand-me-down) | FAILED | Plan: All but ChemIT file share going to AWS. | ||
Chemistry IT: SERV-05: HyperV backup. (RedBarn, rack) | Worthless: Died within 2 minutes. (Was a hand-me-down) | FAILED | |||
RESE-01: HyperV hosts to CRANE-19 (NFS) Crane Synology | Survived | Fine | Fine | ||
Scheraga Matrix headnode Scheraga Matrix Synology | Survived | Fine | Fine | (down) | |
Hoffmann | Survived | Fine | n/a | (down) | Router config reset, so failed |
Lancaster- Crane | Survived | Fine | n/a | (down) | |
Widom-Loring-Abruna | Survived | Fine | n/a | bw001 up, since part of twin head node (all the rest were down) | |
Baird compute server | No UPS | Down (MH restarted remotely via IPMI) | n/a | ||
Petersen | Survived | ||||
Freed's Eldor | ? |
What does it cost to UPS a research system?
Current goal: Do all head nodes and stand-alone computers in 248 Baker Lab. Started getting done, fall 2014.
Assuming protection for 1-3 minutes MAXIMUM:
About $180 (APC brand) per head node or server every ~3-4 years (3 yr for warranty and ~4 years actual battery life).
- Unusual case: ~$900/ set of 4 GPU systems (Scheraga) every ~3-4 years (for this, must confirm approach, appropriateness, and estimates).
Remaining UPS's to invest in
Clusters
Most we been done Spring 14, after the spate of power failures. See CCB's HPC page (first chart, in "UPS for headnode" column) for details
Cluster | Done | Not done | Notes |
---|---|---|---|
Abruna |
| X | Unique: Need to do ASAP |
Non-clusters
See CCB's HPC page (second chart, in "UPS" column) and CCB's non-HPC page (in "UPS" column) for details of the few that are already done.
Stand-alone computers' UPS status:
Computer | Done | Note done | Notes |
---|---|---|---|
Coates: MS SQL Server |
| X | Unique: Need to do ASAP |
Freed: Eldor |
| X | Unique: Need to do ASAP? (Q: Is OS backed up?) |
Review others at above two cited pages which might need a UPS, after above ones done.
Switches
Do all switches: Maybe ~$340 ($170*2), every ~4 years.
- Recommend: Do ASAP.
- Q: Funding?
- Other issues and concerns with actually implementing this approach:
- Rack space. Maybe power. Maybe cord lengths. What other issues?
Do all compute nodes: ~$18K initially, and perhaps ~$4.5K every ~4 years to replace batteries and deal with UPS hardware failures.
- ~20 20amp UPS's ($900 each) required.
- Replacement batteries ~$200 each, or ~1/4 replacement cost.
- Estimates are simply back-of-the-envelope calculations.
- If were to actually implement, there may be smarter ways to do this, but the total cost will likely not be lower.
- In fact, costs may be higher, if sufficiently higher benefit doing it a different way, for example.
- Issues and concerns with actually implementing this approach:
- Costs. Rack space. Maybe power. Maybe cord lengths. What other issues?
Compute node counts, for UPS pricing estimates. Does not include head node:
- Count source: ChemIT's Computer counts with CCB clusters
Cluster | Compute node count | Power strip equivalents | Cost estimate, | Notes |
---|---|---|---|---|
Collum | 8 | 1 | $900 |
|
Lancaster, with Crane (new) | 10 | 2 | $1.8K |
|
Hoffmann | 19 | 2 | $1.8K |
|
Scheraga | 91 | 13 | $11.7K |
|
Loring | 4 | 1 | $900 |
|
Abruna | 9 | 1 | $900 |
|
C4 head node: pilot | N/A | N/A | N/A | This CCB Community head node pilot has no compute nodes of its own. |
Widom | 2 | 1 | ? | Compute nodes are hanging off of "C4" head node, above. |
TOTALS | ~140? | 21 | $18K + Widom |
|