Excerpt |
---|
Clusters and other high performance servers require maintenance. Documented procedures reduce surprises for both enabling scheduled maintenance and emergency work. |
Table of contents
Table of Contents |
---|
Scheduled maintenance and upgrades procedures
Summary
Details
ChemIT notifies cluster lead that maintenance will occur on a specific upcoming date.
- What is a long enough lead time for the group?
- What is a short enough lead time for ChemIT?
Message will state:
- Date and time of shut-down. Expected duration of shut-down.
- Most events will occur Mon-Thur, 9am-5pm EST, when staffing and backup folks are available.
- Purpose summary.
Message will be sent to:
- Whom?
Typical work done during maintenance
(Point to Lulu's current, active checklist! Nov 2015)
Update BIOS (and why it's done...)
Update OS
Update...
Test UPS
Test backups
Test...
Sample message:
To: ?
Subject: PI's ClusterName: Date/ time planned down-time.
-----------------------------------------------
To all users of the PI's ClusterName,
On Date/ time, the cluster will be down for planned maintenance for 3 hours.
During this down-time, we intend to:
- Test new GPU software capabilities
- Update the OS of the storage system.
- Update the BIOS of the 4 GPU clusters.
- Update the UPS software to address current software's limitations.
- Confirm backups and review other system software configurations.
-----------------------------------------------
Communication timeline
1) ChemIT notifies group rep. of planned date.
2) Group rep. confirms there is no better date (or negotiates a better date, with ChemIT staff).
3) Group rep. notifies all users of cluster, using message crafted by ChemIT.
- Or, group rep. requests ChemIT send the email to all cluster users, on their behalf. Message sent to users' <NetID@cornell.edu> address.
4) The work day before the shut down, ChemIT sends a reminder.
5) When cluster is shutdown, ChemIT sends a statement to that affect.
6) ChemIT sends a status report if cluster not up when expected, providing new time estimate.
7) ChemIT sends a report when the server is again available.
Ideas
- Establish a schedule for at least 6 month out. Why? Used by whom? What of things changing?
Emergency work procedures
Children Display | ||||||
---|---|---|---|---|---|---|
|
Emergency work procedures
See also
Communication timeline
1) Something bad happens, which was not scheduled.
...
4) ChemIT notifies group rep (users) of status and prognosis as soon as practicable.
A record (and notes) of Emergency Actions
Cluster or HPC name | Event date | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ChemIT (C4)Eldor |