You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 7 Next »

This page includes a list of to-do's to consider during maintenance periods (when the cluster is turned off).

Purpose

This checklist can be used for:

1) A list of things that require the cluster to be turned off. Or are tasks that are more safely done when the cluster is turned off, even if not strictly required.

2) A list of things to consider doing when the a clusters is being turned off / rebooted for other reasons.

Maintenance Checklist

Activity
category

Item/ topicWho knows if action possible or needed?
(Presumed Lulu, unless marked otherwise)
Notes or comments
P/TNetwork / switchesProject lead or ticket owner 
P/TDisk changes / swaps / repairsProject lead or ticket owner 
P/TAdd / remove nodesProject lead or ticket owner 
DMotherboard: BIOS / firmware updates (and why it's done...)

Michael (how get reminded/ asked?)

  • Ask for review of all clusters in a maintenance cycle
    • Or: Ask for each cluster within a cycle?

Risk-management consideration. Vs. expected benefit.

Q: Just head node?

Q: CAC: What do they do? (ex. Ananth)

DFirmware updates, including drives, cards, etc.Michael (how get reminded/ asked?) 
DScheraga only: Synology updatesMichael (how get reminded/ asked?)Any other "storage system"?

 

D

Kernel updates

  • Only security updates.
  • Account / password sync
  
D

OS Updates

  • Only security updates (via YUM)
  
D

UPS: Review and confirm settings

  • Any updates?
 Confirm thresholds make sense, especially as battery gets older.

Key:

P/T: Default answer is "no action". Is there a Project or Ticket awaiting implementation during the maintenance work?

D: Discretion or decision required. May very well be, "nothing this time around". And just because something can be done, doesn't necessarily mean it should be done: risk/ benefit.

Sequence of actions taken during maintenance

Actionnotes
Schedule / notify representative / users 
Evaluate hard drives 
Verify backups 
Disable outside access 
Delete jobs 
Shutdown headnode & nodes 
Synology updateRequires reboot. Safer with headnode off.
Reboot switches 
Boot synology 
Boot headnode 
Verify drives with fsckOnly do every ~6 months. Time consuming.
Ex: Abruna's cluster (small): 1-2 hours
Test UPS and its notificationsDoes it work as expected? How reasonably test?
Reboot anything that needs 
Boot nodes 
Enable access 
Send email 
  • No labels