Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Actionnotes
Schedule / notify representative / usersCommunication timeline
Evaluate hard drives (system left in operation)

Review all:

  • SMART
  • dmesg
  • Review  /var/log/messages

Verify backups and related (system left in operation)

Back-in-time (local versioning)

  • Do what to verify?

EZ-Backup (remote backup)

  • Review log for recent uploads.
  • Q: Occassionally spot-check restoring a random, recent file?
Crane WS only: Disable login; Kill all user logins, Umount nfs from all WS,

/etc/nologin

pkill -KILL –u *** (kill all users one by one)

umount /notbackedup

umount /home/local/CORNELL

Disable outside access

touch /etc/nologin or vi /etc/nologin with text.

this file will be removed automatically after system reboot.

Delete jobsqdel all
Shutdown headnode & nodes

cd /root; ./shutdown_nodes.sh; (the script to shutdown all compute nodes. Try pestat command before you run this script, you may need modify this script to shutdown all nodes except "down" nodes)

Crane WS only: shutdown nfs server, Synology update, Windows update 
Scheraga only: Synology updateRequires reboot. Safer with headnode off.
Reboot switches

Only for Scheraga Matrix usually. All others have shared switches!

Q: Hoffmann compute nodes isolated switches?

Boot synologyScheraga Matrix only
ddimage OS root partitionboot from centos live cd; sfdisk -d /dev/sda > sda.partition ; dd if=/dev/sda1(md0) of=root.img;
Crane WS only:

Try one yum update on as-chm-cran-12

wait until nfs server is up;

reboot as-chm-cran-12; if reboot OK, yum update on as-chm-cran-13; as-chm-cran-14, as-chm-cran-15

and reboot

Boot headnode 
Verify drives with fsck

touch /forcefsck if we want to force fsck.

Only do every ~6 months. Time consuming.
Ex: Abruna's cluster (small): 1-2 hours

Test UPS and its notifications

Does it work as expected?

How reasonably test? Pull power? Rely on self test?

Reboot anything that needs 
Boot nodes 
Enable accessrm /etc/nologin
Send email 
Add "maintenance recorde" description at HPC's wiki page