Excerpt
This page includes a list of to-do's to consider during maintenance periods (when the cluster is turned off)checklist for preparing any maintenance work, and a listing of the sequence of steps to take.

Purpose

...

2) A list of things to consider doing when the a clusters is being turned off / rebooted for other reasons.

Maintenance Checklist

Activity category	Item/ topic	Who knows if action possible or needed? (Presumed Lulu, unless marked otherwise)	Notes or comments
P/T	Network / switches / router	Project lead or ticket owner
P/T	Disk changes / swaps / repairs	Project lead or ticket owner

Can use SMART for s/w RAID while system is running. All but 3: Freed ACERT Eldor, Hoffmann, and Scheraga Matrix
P/T	Add / remove nodes	Project lead or ticket owner
D	Motherboard

:

BIOS

/ firmware

updates (headnode and/ or compute nodes). And why it's done

..

.

)

Michael (

how get reminded/ asked?

email reminder two days before maintenance)

Ask for review of all clusters in a maintenance cycle
- Or: Ask for each cluster within a cycle?

Risk-management consideration. Vs. expected benefit.

Q: Just head node?

Q: CAC: What do they do? (ex. Ananth)

Our RITMG peers only do when trying to debug a problem.
D	Other firmware

DFirmware

updates, including drives, cards, etc.

Michael

(how get reminded/ asked?)

. Same as for motherboard BIOS update.
D	Scheraga only: Synology updates

Michael (how get reminded/ asked?)

Roger (email reminder two days before maintenance). Backup is Michael.

Any other "storage system"?

D

Kernel updates

Only security updates.
Account / password sync

yum --security update

if atrpms repo has problems, try this:

yum --disablerepo=atrpms --security update

D

OS Updates

Only security updates (via YUM)

D

UPS: Review and confirm settings

Any updates?

Confirm thresholds make sense, especially as battery gets older.

Key:

P/T: Default answer is "no action". This is here in case Is there a Project or Ticket has an impact on this topic, which is to be done awaiting implementation during the maintenance work.?

D: Discretion or decision required. May very well be, "nothing this time around". And just because something can be done, doesn't necessarily mean it should be done: risk/ benefit.

Sequence of actions taken during maintenance

Action	notes

...

Schedule / notify representative / users

Communication timeline

Evaluate hard drives (system left in operation)

Review all:

SMART
dmesg
Review /var/log/messages

Verify backups

...

· Disable outside access

· Delete jobs

· Chkdsk

· Shutdown headnode & nodes

...

and related (system left in operation)	Back-in-time (local versioning) Do what to verify? EZ-Backup (remote backup) Review log for recent uploads. Q: Occassionally spot-check restoring a random, recent file?
Crane WS only: Disable login; Kill all user logins, Umount nfs from all WS,	/etc/nologin killall -u * Or: pkill -KILL –u * (kill all users one by one) umount /notbackedup umount /home/local/CORNELL
Disable outside access	touch /etc/nologin or vi /etc/nologin with text. this file will be removed automatically after system reboot.
Delete jobs	qdel all
Shutdown headnode & nodes	cd /root; ./shutdown_nodes.sh; (the script to shutdown all compute nodes. Try pestat command before you run this script, you may need modify this script to shutdown all nodes except "down" nodes)
Crane WS only: shutdown nfs server, Synology update, Windows update
Scheraga only: Synology update	Requires reboot. Safer with headnode off.

· Reboot switches

· Boot synology

· Boot headnode

o iSCSI config- (on startup) 5 sec ping for timeout, will change to 10 & 15

· Test UPS shutdown error

· Reboot anything that needs

· Boot nodes

· Enable access

· Send email

• UPS config - 10 seconds poll,

...

Reboot switches	Only for Scheraga Matrix usually. All others have shared switches! Q: Hoffmann compute nodes isolated switches?
Boot synology	Scheraga Matrix only
ddimage OS root partition	boot from centos live cd; sfdisk -d /dev/sda > sda.partition ; dd if=/dev/sda1(md0) of=root.img;
Crane WS only: Try one yum update on as-chm-cran-12	wait until nfs server is up; reboot as-chm-cran-12; if reboot OK, yum update on as-chm-cran-13; as-chm-cran-14, as-chm-cran-15 and reboot
Boot headnode
Verify drives with fsck	touch /forcefsck if we want to force fsck. Only do every ~6 months. Time consuming. Ex: Abruna's cluster (small): 1-2 hours
Test UPS and its notifications	Does it work as expected? How reasonably test? Pull power? Rely on self test?
Reboot anything that needs
Boot nodes
Enable access	rm /etc/nologin
Send email
Add "maintenance recorde" description at HPC's wiki page

Space shortcuts

Child pages

Versions Compared

Old Version 6

New Version Current

Key

Table of Contents

Purpose

Maintenance Checklist

Sequence of actions taken during maintenance

Space shortcuts

Child pages

Page History

Versions Compared

Old Version 6

New Version Current

Key

Table of Contents

Purpose

Maintenance Checklist

Sequence of actions taken during maintenance