General
| Deployed, or simply ideas or options which would required further study
| Notes
| Scheraga's Matrix upgrade, 2014 $50K (1st of 3 yrs; ~$150K total) P62 - Scheraga cluster Matrix upgrade, ticket INC000001055288 |
---|
Software
| ChemIT: OS, the cluster's software "stack", core applications, and user applications. Operating system: Current, compatible, supportable? Researchers: Their applications, configurations, scripts, sharing, etc. Maintenance and updates, including scheduling.
- How will OS and apps be kept current?
- Any need to "freeze versions?
- Roll-back process, if any. Process for major upgrades
| See Roles and responsibilities for clusters managed by ChemIT. | Confirm this is Czerek. Confirm applications and their locations (some shared apps are in user directories, which is not a best practice.)
|
Cornell Active Directory | Provides common logon, accounts & resource management.
| -Currently used for most Windows & Macintosh systems -Where / when is Linux AD integration practical? When is it a value to research group or ChemIT? | ChemIT testing, not ready to do this as of 3/2014. (Revisit if we are ready to offer this when time to deploy.) |
Head nodes and compute nodes
| Node form factors: Single, twins, quads. And 1 U, 2U's, etc. Head node - what needs to be on-board, vs separate? Compute node technologies: Anything special required? Examples are GPUs Plan for upgrades, removal, and expansion process, obsolescence.
| See ChemIT's inventory snapshots of CCB's clusters. Ensure contemporary head node, taking into account it's age, warranty, and ease of replacement with a compute node (unique attributes, including hard drive bays).
| ChemIT: Will require a new, dedicated head node. Consider buying compute nodes after head node set up with required software running on a few, old compute nodes. Then buy compute nodes (cheaper, better if wait months?) Compute nodes: Get quads (2U's)? Q: Any GPUs required this first year? If so, not hook up to cluster? (This is as the four other GPU-based compute nodes are- they run completely independently from the cluster, and from each other.)
|
Data storage
| 1) Storage required for headnode and computational use (short term), including job store and user accounts. 2) Longer term storage needs a) On-board storage - limited b) Separate file server - may meet needs better. Examples include: CIT SFS service (NFS is an option there), dedicated file server, storage box (Synology).
| -Head Node on-board storage is space limited -Storing large amounts of data make restores harder, riskier, and more time-consuming, and can reduce cluster reliability, availability, and performance. -Storing large amounts of data needing backups will cost more than smaller amounts of data.
| Consider the value in separating out longer-term user files from those related to current computational data. See backups, below.
|
Backup | Options: Local disk copies / rsync EZ-Backup service Local, on head node: Copies and/ or sync'ed Local, off head node: : Copies and/ or sync'ed Using SFS, which itself has versioning. And price included backups with EZ-Backup. | See Cluster backups and related considerations. No all options are mutually exclusive. Options vary in what they protect against and their start-up and on-going costs. Options vary in restores times and end-user vs. mediated restores. Rule of thumb: The faster you pull unique data off, the less you have to invest in backups. | Get input from Czerek on our current practices, costs, value, as well as other ideas listed. See data storage, above. |
Networking
| Ensure adequate number of network switches are provisioned. Performance / throughput needed (1 GB std, 10 GB and Infiniband available at high cost) Redundancy (Bonding, or other) Cabling. Physical arrangement/ proximity.
| | ChemIT: Will require some more switches and cables.
|
Power
| -Power required per unit & total. -Circuits & connections available in server room. -Power distribution strips required (limits!).
Power loss protection -Requires Uninterruptible Power Supply (UPS). -Costs usually limit to only protecting the head nodes. -Duration of protection? -Form factor options.
| - Understand the effects of power outages on the head node (can be severe) and compute nodes (less), based on duration. - Preferred: provide continuity for short outages and controlled shutdown of head node in case of longer outages. - Recovery / restart processes (manual) for headnodes, cnodes, and networks. -Differences for outages during office hours vs off hours. -Heat dissipation (HVAC), including emergencies. | Recently purchased UPS can be used for the new head node. Any further protection required to reduce downtime?
|
Maintenance
| Regular maintenance - schedule, non-scheduled Downtimes vs live OS patches / upgrades Application addition / updates / removal Hard drive replacement Failure recovery
| |
|
Rack space
| Purchase rack or use space in existing group or department rack.
| Physical arrangement. Form factors (see nodes, above). | Significant rearrangements may be required in 248 to accommodate a lot of new compute nodes.
|
Contacts and roles
| Funder(s). Technical lead (in research group). Testers. Users. ChemIT technical lead
| | |
Schedule & project management
| Chem IT will provide project planning, management and schedule
| Based on: Group research needs & funding ChemIT schedule, priorities & availability Scope of work to be done
| |