The impact on Chemistry's IT of about 40 minutes of unexpected power outage.
Summary
- Parts of this power outage went fairly well. Some less so.
- Cluster head nodes and compute nodes & drives seem to be in good shape.
Things we know failed:
248 Baker
Topic or event | Action taken | Action required | Notes |
---|---|---|---|
ChemIT UPS in cluster rack died. Would not turn on. | Moved Widom HeadNode to Loring UPS Moved C4 to Loring UPS Moved switch (Collum+Widom) to Collum UPS (Lulu add) Removed ChemIT UPS and plugged in to charge, just in case. | ||
ChemIT UPS for Windows servers – limited battery power, Hyper-V machines were not able to shut down gracefully. | Even after 24 hours charging, Synology shows it with only 672 seconds of battery life. Probably not even that | Needs battery or replacement | |
Schernology (Scheraga's Synology) could not be accessed to cleanly shut down since network not working. | Lulu did push power button to shutdown system. It should go to safe mode when battery goes to low. Could have let UPS shut it down cleanly if wanted to wait (confirm true?) | Lulu investigating options. | |
Mathematica license server did not start after restart. | Restarted manually. (common issue) | ||
NMR Web server didn't start up right. | Needed to be kicked a bit to start. | This Gateway needs to go. | |
NMR Router puked | Reprogrammed passwords and forwarding for SSH & RDP. | Needs external administration access configured. |
Other
Topic or event | Action taken | Action required | Notes |
---|---|---|---|
Lee - Steven Lee's SGI had several problems, boot, date, etc. | Lulu wrestled with re-setting (hardware) time. | Advise get UPS? If so, how automate shut-down if no monitor powered? | INC000001652417 |
Marohn - B19 PSB's AS-CHM-Maro-03 RAID-1 had a drive fail. | Oliver tested OK, wiped and re-added to RAID | Should group invest in more risk management, including a UPS? | INC000001652216 |
Fors - Fors instrument came up and started working. Were both the computer and the instrument previously rebooted? | Roger asked group about reboot history of instrument. | Awaiting group's response (Dillon) on reboot history of instrument | INC000001645245 |
Petersen - His group's UPS died. | Roger got him a quote. | INC000001653506 | |
VoIP phones did not boot up correctly and hanged. | Oliver wrote to CIT | CIT: INC000001653941 |