This page includes a list of to-do's to consider during maintenance periods (when the cluster is turned off).
Purpose
This checklist can be used for:
1) A list of things that require the cluster to be turned off. Or are tasks that are more safely done when the cluster is turned off, even if not strictly required.
2) A list of things to consider doing when the a clusters is being turned off / rebooted for other reasons.
Maintenance Checklist
Activity | Item/ topic | Who knows if action possible or needed? (Presumed Lulu, unless marked otherwise) | Notes or comments |
---|---|---|---|
P/T | Network / switches | Project lead or ticket owner | |
P/T | Disk changes / swaps / repairs | Project lead or ticket owner | |
P/T | Add / remove nodes | Project lead or ticket owner | |
D | Motherboard: BIOS / firmware updates (and why it's done...) | Michael (how get reminded/ asked?)
| Risk-management consideration. Vs. expected benefit. Q: Just head node? Q: CAC: What do they do? (ex. Ananth) |
D | Firmware updates, including drives, cards, etc. | Michael (how get reminded/ asked?) | |
D | Scheraga only: Synology updates | Michael (how get reminded/ asked?) | Any other "storage system"? |
D | Kernel updates
| ||
D | OS Updates
| ||
D | UPS: Review and confirm settings
| Confirm thresholds make sense, especially as battery gets older. |
Key:
P/T: Default answer is "no action". Is there a Project or Ticket awaiting implementation during the maintenance work?
D: Discretion or decision required. May very well be, "nothing this time around". And just because something can be done, doesn't necessarily mean it should be done: risk/ benefit.
Sequence of actions taken during maintenance
Action | notes |
---|---|
Schedule / notify representative / users | |
Evaluate hard drives | |
Verify backups | |
Disable outside access | |
Delete jobs | |
Shutdown headnode & nodes | |
Synology update | Requires reboot. Safer with headnode off. |
Reboot switches | |
Boot synology | |
Boot headnode | |
Verify drives with fsck | Only do every ~6 months. Time consuming. Ex: Abruna's cluster (small): 1-2 hours |
Test UPS and its notifications | Does it work as expected? How reasonably test? |
Reboot anything that needs | |
Boot nodes | |
Enable access | |
Send email |