You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 11 Next »

A list of considerations when buying a new cluster, or adding to an existing cluster. Also applies to other high performance computing (HPC) systems.

General

Deployed, or simply iIdeas or options which would required further study

Notes

Scheraga's Matrix upgrade, 2014
$50K (1st of 3 yrs; ~$150K total)

Software

ChemIT: OS, the cluster's software "stack", and core applications.
Researchers: Their applications, configurations, scripts, sharing, etc.
Maintenance and updates, including scheduling. Roll-back process, if any. Process for major upgrades.

See Roles and responsibilities for clusters managed by ChemIT.

Confirm this is Czerek.
Confirm applications and their locations (some shared apps are in user directories, which is not a best practice.)

Backup

EZ-Backup service
Local, on head node: Copies and/ or sync'ed
Local, off head node: : Copies and/ or sync'ed
Using SFS, which itself has versioning. And price included backups with EZ-Backup.

See Cluster backups and related considerations.
No all options are mutually exclusive.
Options vary in what they protect against and their start-up and on-going costs.
Options vary in restores times and end-user vs. mediated restores.
Rule of thumb: The faster you pull unique data off, the less you have to invest in backups.

Get input from Czerek on our current practices, costs, value, as well as other ideas listed.

Head nodes and compute nodes

Ensure contemporary head node, taking into account it's age, warranty, and ease of replacement with a compute node (unique attributes, including hard drive bays).
Node form factors: Single, twins, quads. And 1 U, 2U's, etc.
Compute node technologies: Anything special required? Examples are GPUs and InfiniBand.
Upgrade, removal, and expansion process.

See ChemIT's inventory snapshots of CCB's clusters.

Q:

Data storage

Storage required for headnode and computational use (short term), including job store and user accounts.
Longer term storage needs, in which a file server may meet needs better. Examples include the SFS service (NFS is an option there).

Storing large amounts of data make restores harder, riskier, and more time-consuming. Storing large amounts of data needing backups will cost more than smaller amounts of data.

 

Networking

Ensure adequate number of network switches are provisioned. Cabling. Physical arrangement/ proximity.

 

 

Power

Power strips required (limits!).
Power interruptions: Require Uninterruptible Power Supply (UPS). Costs usually limit to only protecting the head nodes. Duration of protection? Auto-shutdown of head node on protracted outage? Form factor options.
Procedure for staff when power goes out, and when it is restored (recovery, restart). Both during office hours and outside of normal office hours, or if key staff are unavailable.
Heat dissipation (HVAC), including emergencies.

 

 

Cornell Active Directory

When is it a value to research group or ChemIT?

 

 

Rack space

Physical arrangement. Form factors (see nodes, above).

 

 

Upgrade process contacts and roles

Funder(s). Technical lead (in research group). Testers. Users.

 

 

  • No labels