This page includes a checklist for preparing any maintenance work, and a listing of the sequence of steps to take.
Purpose
This checklist can be used for:
1) A list of things that require the cluster to be turned off. Or are tasks that are more safely done when the cluster is turned off, even if not strictly required.
2) A list of things to consider doing when the a clusters is being turned off / rebooted for other reasons.
Maintenance Checklist
Activity | Item/ topic | Who knows if action possible or needed? (Presumed Lulu, unless marked otherwise) | Notes or comments |
---|---|---|---|
P/T | Network / switches / router | Project lead or ticket owner | |
P/T | Disk changes / swaps / repairs | Project lead or ticket owner | Can use SMART for s/w RAID while system is running. All but 3: Freed ACERT Eldor, Hoffmann, and Scheraga Matrix |
P/T | Add / remove nodes | Project lead or ticket owner | |
D | Motherboard BIOS updates (headnode and/ or compute nodes). And why it's done. | Michael (email reminder two days before maintenance)
| Risk-management consideration. Vs. expected benefit. Our RITMG peers only do when trying to debug a problem. |
D | Other firmware updates, including drives, cards, etc. | Michael. Same as for motherboard BIOS update. | |
D | Scheraga only: Synology updates | Roger (email reminder two days before maintenance). Backup is Michael. | Any other "storage system"? |
D | Kernel updates
| yum --security update if atrpms repo has problems, try this: yum --disablerepo=atrpms --security update | |
D | OS Updates
| ||
D | UPS: Review and confirm settings
| Confirm thresholds make sense, especially as battery gets older. |
Key:
P/T: Default answer is "no action". Is there a Project or Ticket awaiting implementation during the maintenance work?
D: Discretion or decision required. May very well be, "nothing this time around". And just because something can be done, doesn't necessarily mean it should be done: risk/ benefit.
Sequence of actions taken during maintenance
Action | notes |
---|---|
Schedule / notify representative / users | Communication timeline |
Evaluate hard drives (system left in operation) | Review all:
|
Verify backups and related (system left in operation) | Back-in-time (local versioning)
EZ-Backup (remote backup)
|
Crane WS only: Disable login; Kill all user logins, Umount nfs from all WS, | /etc/nologin pkill -KILL –u *** (kill all users one by one) umount /notbackedup umount /home/local/CORNELL |
Disable outside access | touch /etc/nologin or vi /etc/nologin with text. this file will be removed automatically after system reboot. |
Delete jobs | qdel all |
Shutdown headnode & nodes | cd /root; ./shutdown_nodes.sh; (the script to shutdown all compute nodes. Try pestat command before you run this script, you may need modify this script to shutdown all nodes except "down" nodes) |
Crane WS only: shutdown nfs server, Synology update, Windows update | |
Scheraga only: Synology update | Requires reboot. Safer with headnode off. |
Reboot switches | Only for Scheraga Matrix usually. All others have shared switches! Q: Hoffmann compute nodes isolated switches? |
Boot synology | Scheraga Matrix only |
ddimage OS root partition | boot from centos live cd; sfdisk -d /dev/sda > sda.partition ; dd if=/dev/sda1(md0) of=root.img; |
Crane WS only: Try one yum update on as-chm-cran-12 | wait until nfs server is up; reboot as-chm-cran-12; if reboot OK, yum update on as-chm-cran-13; as-chm-cran-14, as-chm-cran-15 and reboot |
Boot headnode | |
Verify drives with fsck | touch /forcefsck if we want to force fsck. Only do every ~6 months. Time consuming. |
Test UPS and its notifications | Does it work as expected? How reasonably test? Pull power? Rely on self test? |
Reboot anything that needs | |
Boot nodes | |
Enable access | rm /etc/nologin |
Send email | |
Add "maintenance recorde" description at HPC's wiki page |