Technical considerations when buying a new cluster, or adding to an existing cluster. Also applies to other high performance computing (HPC) systems.

General

Deployed, or simply ideas or options which would required further study

Notes

Scheraga's Matrix upgrade, 2014
$50K (1st of 3 yrs; ~$150K total)

Links

 

 

P62 - Scheraga cluster Matrix upgrade, ticket INC000001055288

Software

ChemIT: OS, the cluster's software "stack", core applications, and user applications.
Operating system: Current, compatible, supportable?
Researchers: Their applications, configurations, scripts, sharing, etc.
Maintenance and updates, including scheduling.

  • How will OS and apps be kept current?
  • Any need to "freeze versions?
  • Roll-back process, if any. Process for major upgrades

See Roles and responsibilities for clusters managed by ChemIT.

Confirm this is Czerek.
Confirm applications and their locations (some shared apps are in user directories, which is not a best practice.)

Cornell Active Directory

Provides common logon, accounts & resource management.

-Currently used for most Windows & Macintosh systems
-Where / when is Linux AD integration practical?
When is it a value to research group or ChemIT?

ChemIT testing, not ready to do this as of 3/2014. (Revisit if we are ready to offer this when time to deploy.)

Head nodes and compute nodes

Node form factors: Single, twins, quads. And 1 U, 2U's, etc.
Head node - what needs to be on-board, vs separate?
Compute node technologies: Anything special required? Examples are GPUs
Plan for upgrades, removal, and expansion process, obsolescence.

See ChemIT's inventory snapshots of CCB's clusters.
Ensure contemporary head node, taking into account it's age, warranty, and ease of replacement with a compute node (unique attributes, including hard drive bays).

ChemIT: Will require a new, dedicated head node.
Consider buying compute nodes after head node set up with required software running on a few, old compute nodes. Then buy compute nodes (cheaper, better if wait months?)
Compute nodes: Get quads (2U's)?
Q: Any GPUs required this first year? If so, not hook up to cluster? (This is as the four other GPU-based compute nodes are- they run completely independently from the cluster, and from each other.)

Data storage

1) Storage required for headnode and computational use (short term), including job store and user accounts.
2) Longer term storage needs
  a) On-board storage - limited
  b) Separate file server - may meet needs better. Examples include: CIT SFS service (NFS is an option there), dedicated file server, storage box (Synology).
Old headnode's hard drive: ChemIT's practice is to retain the hard drive in the working old headnode until new cluster confirmed good. The hard drive contains Sys configs and Apps' configs, which can be referred to in the (unconnected, but otherwise working) head node if needed to compare with new system's software. (Hard drive also contains copies of user data, but newest versions of users' data will be in new headnode, 'natch.) After headnode added as a computer node, the group must decide whether to keep hard drive. And if so, where and what will the retention be?

-Head Node on-board storage is space limited
-Storing large amounts of data make restores harder, riskier, and more time-consuming, and can reduce cluster reliability, availability, and performance.
-Storing large amounts of data needing backups will cost more than smaller amounts of data.

Consider the value in separating out longer-term user files from those related to current computational data.
See backups, below.

Backup

Options:
Local disk copies / rsync
EZ-Backup service
Local, on head node: Copies and/ or sync'ed
Local, off head node: : Copies and/ or sync'ed
Using SFS, which itself has versioning. And price included backups with EZ-Backup.

See zCluster backups and related considerations.
No all options are mutually exclusive.
Options vary in what they protect against and their start-up and on-going costs.
Options vary in restores times and end-user vs. mediated restores.
Rule of thumb: The faster you pull unique data off, the less you have to invest in backups.

Get input from Czerek on our current practices, costs, value, as well as other ideas listed.
See data storage, above.

Networking

Ensure adequate number of network switches are provisioned.
Performance / throughput needed (1 GB std, 10 GB and Infiniband available at high cost)
Redundancy (Bonding, or other)
Cabling.
Physical arrangement/ proximity.

 

ChemIT: Will require some more switches and cables.

Power

-Power required per unit & total.
-Circuits & connections available in server room.
-Power distribution strips required (limits!).

Power loss protection
 -Requires Uninterruptible Power Supply (UPS).
 -Costs usually limit to only protecting the head nodes.
 -Duration of protection?
 -Form factor options.

- Understand the effects of power outages on the head node (can be severe) and compute nodes (less), based on duration.
- Preferred: provide continuity for short outages and controlled shutdown of head node in case of longer outages.
- Recovery / restart processes (manual) for headnodes, cnodes, and networks.
-Differences for outages during office hours vs off hours.
-Heat dissipation (HVAC), including emergencies.

Recently purchased UPS can be used for the new head node. Any further protection required to reduce downtime?

Maintenance

Regular maintenance - schedule, non-scheduled
Downtimes vs live
OS patches / upgrades
Application addition / updates / removal
Hard drive replacement
Failure recovery

 


Rack space

Purchase rack or use space in existing group or department rack.

Physical arrangement.
Form factors (see nodes, above).

Significant rearrangements may be required in 248 to accommodate a lot of new compute nodes.

Contacts and roles

Funder(s).
Technical lead (in research group).
Testers.
Users.
ChemIT technical lead

 

 

Schedule & project management

Chem IT will provide project planning, management and schedule

Based on:
Group research needs & funding
ChemIT schedule, priorities & availability
Scope of work to be done

 

  • No labels