...
Action | notes |
---|---|
Schedule / notify representative / users | Communication timeline |
Evaluate hard drives (system left in operation) | Review all:
|
Verify backups and related (system left in operation) | Back-in-time (local versioning)
EZ-Backup (remote backup)
|
Crane WS only: Disable login; Kill all user logins, Umount nfs from all WS, | /etc/nologin pkill -KILL –u *** (kill all users one by one) umount /notbackedup umount /home/local/CORNELL |
Disable outside access | touch /etc/nologin or vi /etc/nologin with text. this file will be removed automatically after system reboot. |
Delete jobs | qdel all |
Shutdown headnode & nodes | cd /root; ./shutdown_nodes.sh; (the script to shutdown all compute nodes. Try pestat command before you run this script, you may need modify this script to shutdown all nodes except "down" nodes) |
Crane WS only: shutdown nfs server, Synology update, Windows update | |
Scheraga only: Synology update | Requires reboot. Safer with headnode off. |
Reboot switches | Only for Scheraga Matrix usually. All others have shared switches! Q: Hoffmann compute nodes isolated switches? |
Boot synology | Scheraga Matrix only |
ddimage OS root partition | boot from centos live cd; sfdisk -d /dev/sda > sda.partition ; dd if=/dev/sda1(md0) of=root.img; |
Crane WS only: Try one yum update on as-chm-cran-12 | wait until nfs server is up; reboot as-chm-cran-12; if reboot OK, yum update on as-chm-cran-13; as-chm-cran-14, as-chm-cran-15 and reboot |
Boot headnode | |
Verify drives with fsck | touch /forcefsck if we want to force fsck. Only do every ~6 months. Time consuming. |
Test UPS and its notifications | Does it work as expected? How reasonably test? Pull power? Rely on self test? |
Reboot anything that needs | |
Boot nodes | |
Enable access | rm /etc/nologin |
Send email | |
Add "maintenance recorde" description at HPC's wiki page |