...
2) A list of things to consider doing when the a clusters is being turned off / rebooted for other reasons.
Maintenance Checklist
Activity | Item/ topic | Who knows if action possible or needed? (Presumed Lulu, unless marked otherwise) | Notes or comments |
---|---|---|---|
P/T | Network / switches / router | Project lead or ticket owner | |
P/T | Disk changes / swaps / repairs | Project lead or ticket owner | Can use SMART for s/w RAID while system is running. All but 3: Freed ACERT Eldor, Hoffmann, and Scheraga Matrix |
P/T | Add / remove nodes | Project lead or ticket owner | |
D | Motherboard BIOS updates (headnode and/ or compute nodes). And why it's done. | Michael (email reminder two days before maintenance)
| Risk-management consideration. Vs. expected benefit. Our RITMG peers only do when trying to debug a problem. |
D | Other firmware updates, including drives, cards, etc. | Michael. Same as for motherboard BIOS update. | |
D | Scheraga only: Synology updates | Roger (email reminder two days before maintenance). Backup is Michael. | Any other "storage system"? |
D | Kernel updates
|
yum --security update if atrpms repo has problems, try this: yum --disablerepo=atrpms --security update | |||
D | OS Updates
| ||
D | UPS: Review and confirm settings
| Confirm thresholds make sense, especially as battery gets older. |
Key:
P/T: Default answer is "no action". Is there a Project or Ticket awaiting implementation during the maintenance work?
...
Sequence of actions taken during maintenance
Action | notes |
---|---|
Schedule / notify representative / users | Communication timeline |
Evaluate hard drives (system left in operation) | Review all:
|
Verify backups and related (system left in operation) | Back-in-time (local versioning)
EZ-Backup (remote backup)
|
Crane WS only: Disable login; Kill all user logins, Umount nfs from all WS, | /etc/nologin killall -u *** Or: pkill -KILL –u *** (kill all users one by one) umount /notbackedup umount /home/local/CORNELL |
Disable outside access | touch /etc/nologin or vi /etc/nologin with text. this file will be removed automatically after system reboot. |
Delete jobs |
qdel all | |
Shutdown headnode & nodes | cd /root; ./shutdown_nodes.sh; (the script to shutdown all compute nodes. Try pestat command before you run this script, you may need modify this script to shutdown all nodes except "down" nodes) |
Crane WS only: shutdown nfs server, Synology update, Windows update | |
Scheraga only: Synology update | Requires reboot. Safer with headnode off. |
Reboot switches | Only for Scheraga Matrix usually. All others have shared switches! Q: Hoffmann compute nodes isolated switches? |
Boot synology | Scheraga Matrix only |
ddimage OS root partition | boot from centos live cd; sfdisk -d /dev/sda > sda.partition ; dd if=/dev/sda1(md0) of=root.img; |
Crane WS only: Try one yum update on as-chm-cran-12 | wait until nfs server is up; reboot as-chm-cran-12; if reboot OK, yum update on as-chm-cran-13; as-chm-cran-14, as-chm-cran-15 and reboot |
Boot headnode | |
Verify drives with fsck | touch /forcefsck if we want to force fsck. Only do every ~6 months. Time consuming. |
Test UPS and its notifications | Does it work as expected? How reasonably test? Pull power? Rely on self test? |
Reboot anything that needs | |
Boot nodes | |
Enable access | rm /etc/nologin |
Send email | |
Add "maintenance recorde" description at HPC's wiki page |