Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Actionnotes
Schedule / notify representative / usersCommunication timeline
Evaluate hard drives (system left in operation)

Review all:

  • SMART
  • dmesg
  • Review  /var/log/messages

Verify backups and related (system left in operation)

Back-in-time (local versioning)

  • Do what to verify?

EZ-Backup (remote backup)

  • Review log for recent uploads.
  • Q: Occassionally spot-check restoring a random, recent file?
Disable outside access

touch /etc/nologin or vi /etc/nologin with text.

this file will be removed automatically after system reboot.

Delete jobsqdel all
Shutdown headnode & nodes

cd /root; ./shutdown_nodes.sh; (the script to shutdown all compute nodes. Try pestat command before you run this script, you may need modify this script to shutdown all nodes except "down" nodes)

Scheraga only: Synology updateRequires reboot. Safer with headnode off.
Reboot switches

Only for Scheraga Matrix usually. All others have shared switches!

Q: Hoffmann compute nodes isolated switches?

Boot synologyScheraga Matrix only
ddimage OS root partitionboot from centos live cd; sfdisk -d /dev/sda > sda.partition ; dd if=/dev/sda1(md0) of=root.img;
Boot headnode 
Verify drives with fsck

touch /forcefsck if we want to force fsck.

Only do every ~6 months. Time consuming.
Ex: Abruna's cluster (small): 1-2 hours

Test UPS and its notifications

Does it work as expected?

How reasonably test? Pull power? Rely on self test?

Reboot anything that needs 
Boot nodes 
Enable accessrm /etc/nologin
Send email 
Add "maintenance recorde" description at HPC's wiki page