Eldor must find a physical new home by Feb. 1st. Priority is Liang's project, which in the short term is not likely to required Eldor.
Project lead: Oliver <oh10>
Team: Zhichun Liang, Peter Borbat, and Lulu Zhu.
Goal
Run humongous jobs, and run them hundreds of times. Expect that having own equipment is most cost-effective, but reality-check against other options such as CAC's RedCloud service (not mutually exclusive).
Project status
4/5/13 (Fri): Decision made that ChemIT will host new server for at least 6 months. This will provide ChemIT more time to work with Jack, Liang and others on pricing out alternative hosting options, but in the meantime putting the server back under support. This includes Lulu coordinating on the timing of the server's move and its new OS install with Liang (see to-do's).
Liang and Peter have agreed to remove the (computational, GPU) video cards before Lulu upgrades the OS.
- Oliver's understanding, based on prior conversation with Barry: There are 2 high-end GPU's and one low-end one. The on-board video remains, of course.
To do's:
- Lulu and Liang, with Barry, work out timing for when server gets moved from CCMR to ChemIT, and the necessary steps.
- Lulu and Liang work out downtime required for Lulu to replace OS with an OS ChemIT can support.
- Lulu and Liang to work out testing of server post OS change.
- Lulu and Liang to work out service windows for Lulu to do necessary server administration tasks.
Strategy: 2-part
Get Liang set-up with CAC's services (RedCloud?) so he can create the software and test it. And pay per drink at that smaller scale.
- The outcome (and process) of using CAC's hardware and services will inform what needs to happen for the subsequent production mode.
Find a new home for Eldore short-term.
- 6-month interim phase: CRCF invests in hosting Eldore "on the margin" to allow review of hosting options.
- Evaluate options, including CRCF continuing to host Eldore (with compensation?). Or hardware hosting by CAC. Or hardware hosting by some other campus unit.
- Before 6 months, decide where Eldore can hosted in a sustained manner (vs an interim manner done "on the margin").
Identify a sustainable home for Eldore, longer-term
Project notes and efforts
4/3/13: Oliver spoke with Liang and Peter. Liang and Peter approved removing the (computational, GPU) video cards before Lulu upgrades the OS.
As of January, ACERT not charged by CCMR. This means the server is not being supported so we should move that server under support sooner than later.
Resources
http://www.cac.cornell.edu/services/projects.aspx
http://www.cac.cornell.edu/Services/rates/
Oliver's meeting notes, 11/28/12's mtg
Barry: Need to reinstall OS (since depends on AFS and CCMR's infrastructure).
Petr: Remove 4 GPUs, and install in Wintel boxes, one GPU per Wintel box. Barry: Giant sized GPUs: Will they fit? Recommend consult with CRCF before buying Windows-based computers, to confirm fit.
Barry: Currently: RedHat. Future idea: Scientific Linux 6 (based on RHEL 6), 64-bit.
Oliver: Noted that CRCF and CAC are using CentOS (also, RHEL-based).
Barry: ~48-64GB RAM
Liang's computations: 64-bit. X-Windows. Batch.
Jack: Reported Nandini has 40-50 notes, managed by CAC. Very positive. 3 FTE cluster experts. Rapid response. Reasonable, for the nodes. Consulting at about $60/hr. Contact is Resa Alford, <rda1>.
Kevin: If Petr and Kevin to manage: Kevin has past experience with SUSE. Might need paid consulting.
Oliver: If managed by CRCF, likely would be CentOS. Unless good reason not to be that OS.
Oliver's understanding from discussion: Everyone else doesn't need the horsepower of an modern Eldore-class system. They can use CAC, with 32-bit (Skeeve-compatible) or 64-bit.
Decision at meeting:
Consider using CAC if it's a technical "fit", and pay per drink. Get estimate before committing. In the short-term, do this since need VERY high memory. BUT, note that doing this in productions likely not cost-effective compared to investing in own hardware.
Past email threads, pre-meeting
Barry, 11/20/12, 2:34 PM
Eldor will need to be reinstalled. Our installation is tied to AFS and our infrastructure. Going to windows or a standalone Linux system seems like a good idea.
Peter, 11/20/12, 12:51 PM
It may be better to set up user accounts on or to lease CAC v4-64g node (64 GB Read Hat Linux). Lease option may be practicable if the node is used intensively. CAC operates in cost recovery mode, we just need to estimate fees.
ELDOR may not blend with any of CAC's blade systems, so it could be decommissioned and used locally at ACERT as a LINUX or WINDOWS 64-bit development platform. Kevin Hobbs and I can host it if CRCF would not.
Alternatively, I can use it for 3D-EM simulations, for example, since its GPUs are not used by any NLLS code. We surely can find a good use to it.
Oliver, 11/20/12, 10:23 AM
(1) Hosting the Eldore server. (Jack will be speaking with Nandini about CAC's services)