General |
Deployed, or simply ideas or options which would required further study |
Notes |
Scheraga's Matrix upgrade, 2014
$50K (1st of 3 yrs; ~$150K total) |
Software |
ChemIT: OS, the cluster's software "stack", core applications, and user applications.
Operating system: Current, compatible, supportable?
Researchers: Their applications, configurations, scripts, sharing, etc.
Maintenance and updates, including scheduling.
- How will OS and apps be kept current?
- Any need to "freeze versions?
- Roll-back process, if any. Process for major upgrades
|
See Roles and responsibilities for clusters managed by ChemIT. |
Confirm this is Czerek.
Confirm applications and their locations (some shared apps are in user directories, which is not a best practice.) |
Cornell Active Directory |
Provides common logon, accounts & resource management. |
-Currently used for most Windows & Macintosh systems
-Where / when is Linux AD integration practical?
When is it a value to research group or ChemIT? |
ChemIT testing, not ready to do this as of 3/2014. (Revisit if we are ready to offer this when time to deploy.) |
Head nodes and compute nodes |
Node form factors: Single, twins, quads. And 1 U, 2U's, etc.
Head node - what needs to be on-board, vs separate?
Compute node technologies: Anything special required? Examples are GPUs
Plan for upgrades, removal, and expansion process, obsolescence. |
See ChemIT's inventory snapshots of CCB's clusters.
Ensure contemporary head node, taking into account it's age, warranty, and ease of replacement with a compute node (unique attributes, including hard drive bays). |
ChemIT: Will require a new, dedicated head node.
Consider buying compute nodes after head node set up with required software running on a few, old compute nodes. Then buy compute nodes (cheaper, better if wait months?)
Compute nodes: Get quads (2U's)?
Q: Any GPUs required this first year? If so, not hook up to cluster? (This is as the four other GPU-based compute nodes are- they run completely independently from the cluster, and from each other.) |
Data storage |
1) Storage required for headnode and computational use (short term), including job store and user accounts.
2) Longer term storage needs
a) On-board storage - limited
b) Separate file server - may meet needs better. Examples include: CIT SFS service (NFS is an option there), dedicated file server, storage box (Synology). |
-Head Node on-board storage is space limited
-Storing large amounts of data make restores harder, riskier, and more time-consuming, and can reduce cluster reliability, availability, and performance.
-Storing large amounts of data needing backups will cost more than smaller amounts of data. |
Consider the value in separating out longer-term user files from those related to current computational data.
See backups, below. |
Backup |
Options:
Local disk copies / rsync
EZ-Backup service
Local, on head node: Copies and/ or sync'ed
Local, off head node: : Copies and/ or sync'ed
Using SFS, which itself has versioning. And price included backups with EZ-Backup. |
See Cluster backups and related considerations.
No all options are mutually exclusive.
Options vary in what they protect against and their start-up and on-going costs.
Options vary in restores times and end-user vs. mediated restores.
Rule of thumb: The faster you pull unique data off, the less you have to invest in backups. |
Get input from Czerek on our current practices, costs, value, as well as other ideas listed.
See data storage, above. |
Networking |
Ensure adequate number of network switches are provisioned.
Performance / throughput needed (1 GB std, 10 GB and Infiniband available at high cost)
Redundancy (Bonding, or other)
Cabling.
Physical arrangement/ proximity. |
|
ChemIT: Will require some more switches and cables. |
Power |
-Power required per unit & total.
-Circuits & connections available in server room.
-Power distribution strips required (limits!).
Power loss protection
-Requires Uninterruptible Power Supply (UPS).
-Costs usually limit to only protecting the head nodes.
-Duration of protection?
-Form factor options. |
- Understand the effects of power outages on the head node (can be severe) and compute nodes (less), based on duration.
- Preferred: provide continuity for short outages and controlled shutdown of head node in case of longer outages.
- Recovery / restart processes (manual) for headnodes, cnodes, and networks.
-Differences for outages during office hours vs off hours.
-Heat dissipation (HVAC), including emergencies. |
Recently purchased UPS can be used for the new head node. Any further protection required to reduce downtime? |
Maintenance |
Regular maintenance - schedule, non-scheduled
Downtimes vs live
OS patches / upgrades
Application addition / updates / removal
Hard drive replacement
Failure recovery |
|
|
Rack space |
Purchase rack or use space in existing group or department rack. |
Physical arrangement.
Form factors (see nodes, above). |
Significant rearrangements may be required in 248 to accommodate a lot of new compute nodes. |
Contacts and roles |
Funder(s).
Technical lead (in research group).
Testers.
Users.
ChemIT technical lead |
|
|
Schedule & project management |
Chem IT will provide project planning, management and schedule |
Based on:
Group research needs & funding
ChemIT schedule, priorities & availability
Scope of work to be done |
|