Typical work done during maintenance

(Point to Lulu's current, active checklist! Nov 2015)

Update BIOS (and why it's done...)

Update OS

Update...

Test UPS

Test backups

Test...

A proactive maintenance should be scheduled for approximately quarterly, and no longer than 6 months

Sample message:

To: ?
Subject: PI's ClusterName: Date/ time planned down-time.

...

-----------------------------------------------

Communication timeline

1) ChemIT notifies group rep. of planned date.

2) Group rep. confirms there is no better date (or negotiates a better date, with ChemIT staff).

3) Group rep. notifies all users of cluster, using message crafted by ChemIT.

Or, group rep. requests ChemIT send the email to all cluster users, on their behalf. Message sent to users' <NetID@cornell.edu> address.

4) The work day before the shut down, ChemIT sends a reminder.

5) When cluster is shutdown, ChemIT sends a statement to that affect.

6) ChemIT sends a status report if cluster not up when expected, providing new time estimate.

7) ChemIT sends a report when the server is again available.

Ideas

Establish a schedule for at least 6 month out. Why? Used by whom? What of things changing?

Emergency work procedures

Communication timeline

1) Something bad happens, which was not scheduled.

...

4) ChemIT notifies group rep (users) of status and prognosis as soon as practicable.

...

A record (and notes) of Emergency Actions

Cluster or HPC name	Event date and action
Abruna
Ananth
Collum
Hoffmann
Lancaster (w/ Crane)
Scheraga
Widom-Loring
Eldor

Space shortcuts

Child pages

Versions Compared

Old Version 11

New Version 12

Key

Typical work done during maintenance

Sample message:

Communication timeline

Ideas

Emergency work procedures

Emergency work procedures

See also

Communication timeline

A record (and notes) of Emergency Actions

Space shortcuts

Child pages

Page History

Versions Compared

Old Version 11

New Version 12

Key

Typical work done during maintenance

Sample message:

Communication timeline

Ideas

Emergency work procedures

Emergency work procedures

See also

Communication timeline

A record (and notes) of Emergency Actions