Clusters and other high performance servers require maintenance. Documented procedures reduce surprises for both enabling scheduled maintenance and emergency work.
Table of contents
Scheduled maintenance and upgrades procedures
Emergency work procedures
See also
Communication timeline
1) Something bad happens, which was not scheduled.
- Power outages have been the most frequent, recent reasons for downtime
- Please talk to us to discuss cost/ benefits options to reduce power outage impacts, if you haven't already done so. Thank you!
2) ChemIT learns of the emergency situation.
- This often happens by researchers confirming cluster is down, and then emailing ChemIT with evidence.
- ChemIT staff are available M-F, 9am-5 most days. Weekends, evenings, and holidays
3) ChemIT characterizes the problem and develops an initial prognosis.
4) ChemIT notifies group rep (users) of status and prognosis as soon as practicable.
A record (and notes) of Emergency Actions
Cluster or HPC name | Event date | |||||
---|---|---|---|---|---|---|
Eldor |