Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Excerpt

This list includes things that require the cluster to be turned off.page includes a list of to-do's to consider during maintenance periods (when the cluster is turned off).

Table of Contents

Purpose

This checklist can be used for:

1) A list of things that require the cluster to be turned off. Or are tasks that are more safely done when the cluster is turned off, even if not strictly required.

2) A list of things to consider doing when the a clusters is being turned off / rebooted for other reasons.

 

Examples of activities taken in past maintenance windows:

During this down-time, we intend to:

  • Test new GPU software capabilities
  • Update the OS of the storage system.
  • Update the BIOS of the 4 GPU clusters.
  • Update the UPS software to address current software's limitations.
  • Confirm backups and review other system software configurations.

Maintenance Checklist

Activity
category

Item/ topicWho decidesknows if action possible or needed?
(Presumed Lulu, unless marked otherwise)
Notes or comments
P/TNetwork / switchesProject lead or ticket owner 
P/TDisk changes / swaps / repairsProject lead or ticket owner 
P/TAdd / remove nodesProject lead or ticket owner 
DMotherboard: BIOS / firmware updates (and why it's done...)

Michael

 DScheraga only: Synology updatesMichael

(how get reminded/ asked?)

  • Ask for review of all clusters in a maintenance cycle
    • Or: Ask for each cluster within a cycle?

Risk-management consideration. Vs. expected benefit.

Q: Just head node?

Q: CAC: What do they do? (ex. Ananth)

Any other "storage system"?

DFirmware updates, including drives, cards, etc.Michael (how get reminded/ asked?) 
DScheraga only: Synology updatesMichael (how get reminded/ asked?)Any other "storage system"?

 

D

Kernel updates

  • Only security updates.
  • Account / password sync
  
D

OS Updates

  • Only security updates (via YUM)
  
D

UPS

/ check / updates / settings

: Review and confirm settings

  • Any updates?
  Confirm thresholds make sense, especially as battery gets older.

Key:

P/T: Default answer is "no action". This is here in case a Project or Ticket has an impact on this topic, which is to be done during the maintenance work.

...