Clusters and other high performance servers require maintenance. Documented procedures reduce surprises for both enabling scheduled maintenance and emergency work.

Scheduled maintenance and upgrades procedures

Cluster and HPC maintenance schedules

Regular maintenance of clusters requires downtime. A maintenance schedule can reduce surprises and not unnecessarily delay required maintenance.

Cluster Maintenance SOP — This page includes a checklist for preparing any maintenance work, and a listing of the sequence of steps to take.
Templates of notification emails
- Communication timeline — Sequence of communications for cluster maintenance.
- Downtime warning email template — Page contains process and template text. Text is used two weeks before downtime, and again 24 hours before.
- Maintenance complete/delayed email template — One email will be sent to all group users about the completion of the work or delay of the downtime will be posted.

Emergency work procedures

Communication timeline

1) Something bad happens, which was not scheduled.

Power outages have been the most frequent, recent reasons for downtime
- Please talk to us to discuss cost/ benefits options to reduce power outage impacts, if you haven't already done so. Thank you!

2) ChemIT learns of the emergency situation.

This often happens by researchers confirming cluster is down, and then emailing ChemIT with evidence.
- ChemIT staff are available M-F, 9am-5 most days. Weekends, evenings, and holidays

3) ChemIT characterizes the problem and develops an initial prognosis.

4) ChemIT notifies group rep (users) of status and prognosis as soon as practicable.

A record (and notes) of Emergency Actions

Cluster or HPC name	Event date and action
Abruna
Ananth
Collum
Hoffmann
Lancaster (w/ Crane)
Scheraga
Widom-Loring
Eldor

Space shortcuts

Child pages

Table of contents

Scheduled maintenance and upgrades procedures

Cluster and HPC maintenance schedules

Emergency work procedures

See also

Communication timeline

A record (and notes) of Emergency Actions

Space shortcuts

Child pages

zMaintenance and emergency procedures

Table of contents

Scheduled maintenance and upgrades procedures

Cluster and HPC maintenance schedules

Emergency work procedures

See also

Communication timeline

A record (and notes) of Emergency Actions