Excerpt |
---|
This page includes a list of to-do's to consider during maintenance periods (when the cluster is turned off)checklist for preparing any maintenance work, and a listing of the sequence of steps to take. |
Table of Contents |
---|
Purpose
...
2) A list of things to consider doing when the a clusters is being turned off / rebooted for other reasons.
Maintenance Checklist
Activity | Item/ topic | Who knows if action possible or needed? (Presumed Lulu, unless marked otherwise) | Notes or comments |
---|---|---|---|
P/T | Network / switches / router | Project lead or ticket owner | |
P/T | Disk changes / swaps / repairs | Project lead or ticket owner |
Can use SMART for s/w RAID while system is running. All but 3: Freed ACERT Eldor, Hoffmann, and Scheraga Matrix | |||
P/T | Add / remove nodes | Project lead or ticket owner | |
D | Motherboard |
BIOS |
updates (headnode and/ or compute nodes). And why it's done |
. |
Michael ( |
email reminder two days before maintenance)
| Risk-management consideration. Vs. expected benefit. |
Q: Just head node?
Q: CAC: What do they do? (ex. Ananth)
Our RITMG peers only do when trying to debug a problem. | |
D | Other firmware |
updates, including drives, cards, etc. | Michael |
. Same as for motherboard BIOS update. | |
D | Scheraga only: Synology updates |
Roger (email reminder two days before maintenance). Backup is Michael. | Any other "storage system"? | |
D | Kernel updates
|
yum --security update if atrpms repo has problems, try this: yum --disablerepo=atrpms --security update |
D | OS Updates
| ||
D | UPS: Review and confirm settings
| Confirm thresholds make sense, especially as battery gets older. |
Key:
P/T: Default answer is "no action". This is here in case Is there a Project or Ticket has an impact on this topic, which is to be done awaiting implementation during the maintenance work.?
D: Discretion or decision required. May very well be, "nothing this time around". And just because something can be done, doesn't necessarily mean it should be done: risk/ benefit.
Sequence of actions taken during maintenance
Action | notes |
---|
...
Schedule / notify representative / users | Communication timeline |
Evaluate hard drives (system left in operation) | Review all:
|
Verify backups |
...
· Disable outside access
· Delete jobs
· Chkdsk
· Shutdown headnode & nodes
...
and related (system left in operation) | Back-in-time (local versioning)
EZ-Backup (remote backup)
|
Crane WS only: Disable login; Kill all user logins, Umount nfs from all WS, | /etc/nologin killall -u *** Or: pkill -KILL –u *** (kill all users one by one) umount /notbackedup umount /home/local/CORNELL |
Disable outside access | touch /etc/nologin or vi /etc/nologin with text. this file will be removed automatically after system reboot. |
Delete jobs | qdel all |
Shutdown headnode & nodes | cd /root; ./shutdown_nodes.sh; (the script to shutdown all compute nodes. Try pestat command before you run this script, you may need modify this script to shutdown all nodes except "down" nodes) |
Crane WS only: shutdown nfs server, Synology update, Windows update | |
Scheraga only: Synology update | Requires reboot. Safer with headnode off. |
· Reboot switches
· Boot synology
· Boot headnode
o iSCSI config- (on startup) 5 sec ping for timeout, will change to 10 & 15
· Test UPS shutdown error
· Reboot anything that needs
· Boot nodes
· Enable access
· Send email
• UPS config - 10 seconds poll,
...
Reboot switches | Only for Scheraga Matrix usually. All others have shared switches! Q: Hoffmann compute nodes isolated switches? |
Boot synology | Scheraga Matrix only |
ddimage OS root partition | boot from centos live cd; sfdisk -d /dev/sda > sda.partition ; dd if=/dev/sda1(md0) of=root.img; |
Crane WS only: Try one yum update on as-chm-cran-12 | wait until nfs server is up; reboot as-chm-cran-12; if reboot OK, yum update on as-chm-cran-13; as-chm-cran-14, as-chm-cran-15 and reboot |
Boot headnode | |
Verify drives with fsck | touch /forcefsck if we want to force fsck. Only do every ~6 months. Time consuming. |
Test UPS and its notifications | Does it work as expected? How reasonably test? Pull power? Rely on self test? |
Reboot anything that needs | |
Boot nodes | |
Enable access | rm /etc/nologin |
Send email | |
Add "maintenance recorde" description at HPC's wiki page |