You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

This list includes things that require the cluster to be turned off.

Purpose

This checklist can be used for

1) A list of things that require the cluster to be turned off. Or are tasks that are more safely done when the cluster is turned off, even if not strictly required.

2) A list of things to consider doing when the a clusters is being turned off / rebooted for other reasons.

 

Examples of activities taken in past maintenance windows:

During this down-time, we intend to:

  • Test new GPU software capabilities
  • Update the OS of the storage system.
  • Update the BIOS of the 4 GPU clusters.
  • Update the UPS software to address current software's limitations.
  • Confirm backups and review other system software configurations.

Maintenance Checklist

· BIOS / firmware updates (and why it's done...)

· Network / switches

· Synology / storage system Updates

· Firmware updates, including drives, cards, etc.

· Disk changes / swaps / repairs

· Add / remove nodes

· Kernel updates: Only security updates. Account / password sync

· OS Updates: Only security updates (via YUM)

· UPS / check / updates / settings

 

 

 

Sequence taken during maintenance

· Schedule / notify representative / users

Evaluate hard drives

Verify backups

· Disable outside access

· Delete jobs

· Chkdsk

· Shutdown headnode & nodes

· Synology update - requires reboot. Safer with headnode off

· Reboot switches

· Boot synology

· Boot headnode

    o iSCSI config- (on startup) 5 sec ping for timeout, will change to 10 & 15

· Test UPS shutdown error

· Reboot anything that needs

· Boot nodes

· Enable access

· Send email

 

    • UPS config - 10 seconds poll,  

 

 

  • No labels