Excerpt |
---|
Clusters and other high performance servers require maintenance. Documented procedures reduce surprises for both enabling scheduled maintenance and emergency work. |
Table of contents
Table of Contents |
---|
Scheduled maintenance and upgrades procedures
ChemIT notifies cluster lead that maintenance will occur on a specific upcoming date.
- What is a long enough lead time for the group?
- What is a short enough lead time for ChemIT?
Message will state:
- Date and time of shut-down. Expected duration of shut-down.
- Most events will occur Mon-Thur, 9am-5pm EST, when staffing and backup folks are available.
- Purpose summary.
Message will be sent to:
- Whom?
Sample message:
To: ?
Subject: PI's ClusterName: Date/ time planned down-time.
To all users of the PI's ClusterName,
On Date/ time, the cluster will be down for planned maintenance for 3 hours.
During this down-time, we intend to:
- Update the OS of the storage system.
- Update the BIOS of the 4 GPU clusters.
- Test new GPU software capabilities
Ideas
Establish a schedule for at least 6 month out. Why? Used by whom? What of things changing?
Emergency work procedures
ChemIT learn of the emergency situation.
- The most common is a power outage.
Typical work done during maintenance
Update BIOS (and why it's done...)
Update OS
Update...
Test UPS
Test backups
Test...
Children Display | ||||||
---|---|---|---|---|---|---|
|
Emergency work procedures
See also
Communication timeline
1) Something bad happens, which was not scheduled.
- Power outages have been the most frequent, recent reasons for downtime
- Please talk to us to discuss cost/ benefits options to reduce power outage impacts, if you haven't already done so. Thank you!
2) ChemIT learns of the emergency situation.
- This often happens by researchers confirming cluster is down, and then emailing ChemIT with evidence.
- ChemIT staff are available M-F, 9am-5 most days. Weekends, evenings, and holidays
3) ChemIT characterizes the problem and develops an initial prognosis.
4) ChemIT notifies group rep (users) of status and prognosis as soon as practicable.
A record (and notes) of Emergency Actions
Cluster or HPC Cluster name | Event date | |||||
---|---|---|---|---|---|---|
ChemIT (C4)Eldor |