This page includes a checklist for preparing any maintenance work, and a listing of the sequence of steps to take.

Purpose

This checklist can be used for:

1) A list of things that require the cluster to be turned off. Or are tasks that are more safely done when the cluster is turned off, even if not strictly required.

2) A list of things to consider doing when the a clusters is being turned off / rebooted for other reasons.

Maintenance Checklist

Activity category	Item/ topic	Who knows if action possible or needed? (Presumed Lulu, unless marked otherwise)	Notes or comments
P/T	Network / switches / router	Project lead or ticket owner
P/T	Disk changes / swaps / repairs	Project lead or ticket owner	Can use SMART for s/w RAID while system is running. All but 3: Freed ACERT Eldor, Hoffmann, and Scheraga Matrix
P/T	Add / remove nodes	Project lead or ticket owner
D	Motherboard BIOS updates (headnode and/ or compute nodes). And why it's done.	Michael (email reminder two days before maintenance) Ask for review of all clusters in a maintenance cycle Or: Ask for each cluster within a cycle?	Risk-management consideration. Vs. expected benefit. Our RITMG peers only do when trying to debug a problem.
D	Other firmware updates, including drives, cards, etc.	Michael. Same as for motherboard BIOS update.
D	Scheraga only: Synology updates	Roger (email reminder two days before maintenance). Backup is Michael.	Any other "storage system"?
D	Kernel updates Only security updates. Account / password sync
D	OS Updates Only security updates (via YUM)
D	UPS: Review and confirm settings Any updates?		Confirm thresholds make sense, especially as battery gets older.

Key:

P/T: Default answer is "no action". Is there a Project or Ticket awaiting implementation during the maintenance work?

D: Discretion or decision required. May very well be, "nothing this time around". And just because something can be done, doesn't necessarily mean it should be done: risk/ benefit.

Sequence of actions taken during maintenance

Action	notes
Schedule / notify representative / users	Communication timeline
Evaluate hard drives (system left in operation)	Review all: SMART dmesg Review /var/log/messages
Verify backups and related (system left in operation)	Back-in-time (local versioning) Do what to verify? EZ-Backup (remote backup) Review log for recent uploads. Q: Occassionally spot-check restoring a random, recent file?
Disable outside access
Delete jobs
Shutdown headnode & nodes
Scheraga only: Synology update	Requires reboot. Safer with headnode off.
Reboot switches	Only for Scheraga Matrix usually. All others have shared switches! Q: Hoffmann compute nodes isolated switches?
Boot synology	Scheraga Matrix only
Boot headnode
Verify drives with fsck	Only do every ~6 months. Time consuming. Ex: Abruna's cluster (small): 1-2 hours
Test UPS and its notifications	Does it work as expected? How reasonably test? Pull power? Rely on self test?
Reboot anything that needs
Boot nodes
Enable access
Send email
Add "maintenance recorde" description at HPC's wiki page

Space shortcuts

Child pages

Purpose

Maintenance Checklist

Sequence of actions taken during maintenance

Space shortcuts

Child pages

Cluster Maintenance SOP

Purpose

Maintenance Checklist

Sequence of actions taken during maintenance