Table of contents
Scheduled maintenance and upgrades procedures
Summary
Details
ChemIT notifies cluster lead that maintenance will occur on a specific upcoming date.
- What is a long enough lead time for the group?
- What is a short enough lead time for ChemIT?
Message will state:
- Date and time of shut-down. Expected duration of shut-down.
- Most events will occur Mon-Thur, 9am-5pm EST, when staffing and backup folks are available.
- Purpose summary.
Message will be sent to:
- Whom?
Typical work done during maintenance
(Point to Lulu's current, active checklist! Nov 2015)
Update BIOS (and why it's done...)
Update OS
Update...
Test UPS
Test backups
Test...
Sample message:
To: ?
Subject: PI's ClusterName: Date/ time planned down-time.
-----------------------------------------------
To all users of the PI's ClusterName,
On Date/ time, the cluster will be down for planned maintenance for 3 hours.
During this down-time, we intend to:
- Test new GPU software capabilities
- Update the OS of the storage system.
- Update the BIOS of the 4 GPU clusters.
- Update the UPS software to address current software's limitations.
- Confirm backups and review other system software configurations.
-----------------------------------------------
Communication timeline
1) ChemIT notifies group rep. of planned date.
2) Group rep. confirms there is no better date (or negotiates a better date, with ChemIT staff).
3) Group rep. notifies all users of cluster, using message crafted by ChemIT.
- Or, group rep. requests ChemIT send the email to all cluster users, on their behalf. Message sent to users' <NetID@cornell.edu> address.
4) The work day before the shut down, ChemIT sends a reminder.
5) When cluster is shutdown, ChemIT sends a statement to that affect.
6) ChemIT sends a status report if cluster not up when expected, providing new time estimate.
7) ChemIT sends a report when the server is again available.
Ideas
- Establish a schedule for at least 6 month out. Why? Used by whom? What of things changing?
Emergency work procedures
Communication timeline
1) Something bad happens, which was not scheduled.
- Power outages have been the most frequent, recent reasons for downtime
- Please talk to us to discuss cost/ benefits options to reduce power outage impacts, if you haven't already done so. Thank you!
2) ChemIT learns of the emergency situation.
- This often happens by researchers confirming cluster is down, and then emailing ChemIT with evidence.
- ChemIT staff are available M-F, 9am-5 most days. Weekends, evenings, and holidays
3) ChemIT characterizes the problem and develops an initial prognosis.
4) ChemIT notifies group rep (users) of status and prognosis as soon as practicable.