Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Excerpt

This list includes things that require the cluster to be turned off.page includes a checklist for preparing any maintenance work, and a listing of the sequence of steps to take.

Table of Contents

Purpose

This checklist can be used for:

1) A list of things that require the cluster to be turned off. Or are tasks that are more safely done when the cluster is turned off, even if not strictly required.

2) A list of things to consider doing when the a clusters is being turned off / rebooted for other reasons.

 

Examples of activities taken in past maintenance windows:

During this down-time, we intend to:

  • Test new GPU software capabilities
  • Update the OS of the storage system.
  • Update the BIOS of the 4 GPU clusters.
  • Update the UPS software to address current software's limitations.
  • Confirm backups and review other system software configurations.

Maintenance Checklist

Activity
category

Item/ topicWho
decides
knows if action possible or needed?
(Presumed Lulu, unless marked otherwise)
Notes or comments
P/TNetwork / switches / routerProject lead or ticket owner 
P/TDisk changes / swaps / repairsProject lead or ticket owner
 
Can use SMART for s/w RAID while system is running. All but 3: Freed ACERT Eldor, Hoffmann, and Scheraga Matrix
P/TAdd / remove nodesProject lead or ticket owner 
DMotherboard BIOS
/ firmware
updates (headnode and/ or compute nodes). And why it's done.
..)

Michael

 DSynology / storage system UpdatesMichael DFirmware

(email reminder two days before maintenance)

  • Ask for review of all clusters in a maintenance cycle
    • Or: Ask for each cluster within a cycle?

Risk-management consideration. Vs. expected benefit.

Our RITMG peers only do when trying to debug a problem.

DOther firmware updates, including drives, cards, etc.Michael. Same as for motherboard BIOS update. 
DScheraga only: Synology updatesRoger (email reminder two days before maintenance). Backup is Michael.Any other "storage system"?

 

D

Kernel updates

  • Only security updates.
  • Account / password sync
 
 

yum --security update

if atrpms repo has problems, try this:

yum --disablerepo=atrpms  --security update

D

OS Updates

  • Only security updates (via YUM)
  
D

UPS

/ check / updates / settings

: Review and confirm settings

  • Any updates?
 
 
Confirm thresholds make sense, especially as battery gets older.

Key:

P/T: Default answer is "no action". This is here in case Is there a Project or Ticket has an impact on this topic, which is to be done awaiting implementation during the maintenance work.?

D: Discretion or decision required. May very well be, "nothing this time around". And just because something can be done, doesn't necessarily mean it should be done: risk/ benefit.

Sequence of actions taken during maintenance

Action

...

notes
Schedule / notify representative / usersCommunication timeline
Evaluate hard drives (system left in operation)

Review all:

  • SMART
  • dmesg
  • Review  /var/log/messages

Verify backups

...

· Disable outside access

· Delete jobs

· Chkdsk

· Shutdown headnode & nodes

...

and related (system left in operation)

Back-in-time (local versioning)

  • Do what to verify?

EZ-Backup (remote backup)

  • Review log for recent uploads.
  • Q: Occassionally spot-check restoring a random, recent file?
Crane WS only: Disable login; Kill all user logins, Umount nfs from all WS,

/etc/nologin

 killall -u *** Or: pkill -KILL –u *** (kill all users one by one)

umount /notbackedup

umount /home/local/CORNELL

Disable outside access

touch /etc/nologin or vi /etc/nologin with text.

this file will be removed automatically after system reboot.

Delete jobsqdel all
Shutdown headnode & nodes

cd /root; ./shutdown_nodes.sh; (the script to shutdown all compute nodes. Try pestat command before you run this script, you may need modify this script to shutdown all nodes except "down" nodes)

Crane WS only: shutdown nfs server, Synology update, Windows update 
Scheraga only: Synology updateRequires reboot. Safer with headnode off.

...

Reboot switches

· Boot synology

· Boot headnode

    o iSCSI config- (on startup) 5 sec ping for timeout, will change to 10 & 15

· Test UPS shutdown error

· Reboot anything that needs

· Boot nodes

· Enable access

· Send email

    • UPS config - 10 seconds poll,  

Only for Scheraga Matrix usually. All others have shared switches!

Q: Hoffmann compute nodes isolated switches?

Boot synologyScheraga Matrix only
ddimage OS root partitionboot from centos live cd; sfdisk -d /dev/sda > sda.partition ; dd if=/dev/sda1(md0) of=root.img;
Crane WS only:

Try one yum update on as-chm-cran-12

wait until nfs server is up;

reboot as-chm-cran-12; if reboot OK, yum update on as-chm-cran-13; as-chm-cran-14, as-chm-cran-15

and reboot

Boot headnode 
Verify drives with fsck

touch /forcefsck if we want to force fsck.

Only do every ~6 months. Time consuming.
Ex: Abruna's cluster (small): 1-2 hours

Test UPS and its notifications

Does it work as expected?

How reasonably test? Pull power? Rely on self test?

Reboot anything that needs 
Boot nodes 
Enable accessrm /etc/nologin
Send email 
Add "maintenance recorde" description at HPC's wiki page

...