Eldor must find a physical new home by Feb. 1st. Priority is Liang's project, which in the short term is not likely to required Eldor.
Project lead: Lulu Zhu <cz67@cornell.edu>
Team: Zhichun Liang, Peter Borbat, and Oliver Habicht.
Goal
Run humongous jobs, and run them hundreds of times. Expect that having own equipment is most cost-effective, but reality-check against other options such as CAC's RedCloud service (not mutually exclusive).
To buy us time, move Eldor under CRCF's management for ~6 months. This requires a new OS and file system service to replace CCMR's AFS.
Note
Required changes to Eldor are linked to moving off of CCMR's AFS. Hence the move from AFS will be being explained here, as a sub-project to the Eldor move project.
Steps and draft timeline
Ensure CIFS (replacement to AFS) is working for Liang. Includes testing permissions, including writes.
Draft start date: May 24
Duration: About two days.
Make sure everyone has alternative ways to access AFS.
Turn off Eldor
Tentative date: May 29.
Turn on Eldor, connected to CIFS
Tentative date: About a week after Eldor is turned off.
Ensure Eldor is working for Liang. Includes testing ssh, software.
Draft start date: June 6.
Duration: About two days.
Move Liang's data from AFS to CIFS. Ensure CIFS is working for Liang.
Draft start date: June 10.
Duration: About two days.
After moving is done, Liang will startaccess Eldor
Move other users' data from AFS to CIFS one by one.
Draft start date: Week of June 10.
Duration: About one-two weeks.
Turn off AFS (waiting until now allows us to roll-back, if Eldor must be reverted)
Tentative date: Soon after Eldor is functional.
Status
Must identify all current users of Eldor
Required to coordinate the downtime required to move and reinstall the OS.
Lulu is coordinating the downtime dates and downtime duration with Liang.
4/19 Discussions Oliver had:
- Alex using Eldor regularly. He simply needs to be notified of the downtime and expected duration of downtime. No other coordination expected, per him.
- Peter Borbat not currently using Eldor. Nor will he be using it in the short-term. No need to keep him informed, per him.
4/17/13 (Wed): Oliver and Barry looked at Eldor's partitions. The /a and /b partitions are each 1.7TB in size. They looked like they had no data beyond that assigned as overhead by the OS.
Also, Barry recommends we use OS v6, such as Cent OS 6.x. (Currently using RH 5.x.)
4/5/13 (Fri): Decision made that ChemIT (CRCF) will host Eldor for at least 6 months. This will provide ChemIT more time to work with Jack, Liang and others on pricing out alternative hosting options, but in the meantime putting the server back under support. This includes Lulu coordinating with Liang and others on the timing of the server's move and its new OS install (see to-do's).
Liang and Peter have agreed to remove the (computational, GPU) video cards before Lulu upgrades the OS. These would go to Peter.
- Oliver's understanding, based on prior conversation with Barry: There are 2 high-end GPU's and one low-end one. These would be removed. The on-board video remains, of course.
To do's
- Lulu and Liang, with Barry, work out timing for when server gets moved from CCMR to ChemIT, and the necessary steps.
- Lulu and Liang work out downtime required for Lulu to replace OS with an OS ChemIT can support.
- ChemIT staff propose file share service to be used by Eldor post-migration, with pricing estimates.
- File share system goes through a technical review by Barry, Liang, Peter (and others?) so they can recommend it to Jack.
- Jack approves the file share service his staff recommend.
- Migration done per schedule created in previous to-do's.
- Server is moved to ChemIT.
- Lulu removes the GPUs and gives them to Peter.
- Lulu installs OS.
- Lulu and Liang test server post OS change.
- Lulu and Liang to work out service windows for Lulu to do necessary periodic server administration tasks.
- Liang confirms server is operational.
Peter writes (4/5/13):
Removing two power-hungry C2050 GPUs will make the process lot easier.
There is no current use for them in this system, but we need them for 3D EM simulations. There is no need to develop anything for 3D EM, since this is supported by the vendor.
At some point in the future we may develop useful CUDA support for NLLS.
This development process, however, will be well served by a smaller CUDA-enabled video card.
Strategy: 2-part
Get Liang set-up with CAC's services (RedCloud?) so he can create the software and test it. And pay per drink at that smaller scale.
- The outcome (and process) of using CAC's hardware and services will inform what needs to happen for the subsequent production mode.
Find a new home for Eldore short-term.
- 6-month interim phase: CRCF invests in hosting Eldore "on the margin" to allow review of hosting options.
- Evaluate options, including CRCF continuing to host Eldore (with compensation?). Or hardware hosting by CAC. Or hardware hosting by some other campus unit.
- Before 6 months, decide where Eldore can hosted in a sustained manner (vs an interim manner done "on the margin").
Identify a sustainable home for Eldore, longer-term
Project notes and efforts
5/22/13: Lulu set up CIT SFS CIFS service, in preparation for Eldor's move to CRCF.
As of 5/22/13: Lulu worked out Linux/ CU AD integration.
As of 5/1/13: Lulu is working on Linux/ CU AD integration in preparation for Eldor's move to CRCF.
4/3/13: Oliver spoke with Liang and Peter. Liang and Peter approved removing the (computational, GPU) video cards before Lulu upgrades the OS.
As of January, ACERT not charged by CCMR. This means the server is not being supported so we should move that server under support sooner than later.
Resources
http://www.cac.cornell.edu/services/projects.aspx
http://www.cac.cornell.edu/Services/rates/
Oliver's meeting notes, 11/28/12's mtg
Barry: Need to reinstall OS (since depends on AFS and CCMR's infrastructure).
Petr: Remove 4 GPUs, and install in Wintel boxes, one GPU per Wintel box. Barry: Giant sized GPUs: Will they fit? Recommend consult with CRCF before buying Windows-based computers, to confirm fit.
Barry: Currently: RedHat. Future idea: Scientific Linux 6 (based on RHEL 6), 64-bit.
Oliver: Noted that CRCF and CAC are using CentOS (also, RHEL-based).
Barry: ~48-64GB RAM
Liang's computations: 64-bit. X-Windows. Batch.
Jack: Reported Nandini has 40-50 notes, managed by CAC. Very positive. 3 FTE cluster experts. Rapid response. Reasonable, for the nodes. Consulting at about $60/hr. Contact is Resa Alford, <rda1>.
Kevin: If Petr and Kevin to manage: Kevin has past experience with SUSE. Might need paid consulting.
Oliver: If managed by CRCF, likely would be CentOS. Unless good reason not to be that OS.
Oliver's understanding from discussion: Everyone else doesn't need the horsepower of an modern Eldore-class system. They can use CAC, with 32-bit (Skeeve-compatible) or 64-bit.
Decision at meeting:
Consider using CAC if it's a technical "fit", and pay per drink. Get estimate before committing. In the short-term, do this since need VERY high memory. BUT, note that doing this in productions likely not cost-effective compared to investing in own hardware.
Past email threads, pre-meeting
Barry, 11/20/12, 2:34 PM
Eldor will need to be reinstalled. Our installation is tied to AFS and our infrastructure. Going to windows or a standalone Linux system seems like a good idea.
Peter, 11/20/12, 12:51 PM
It may be better to set up user accounts on or to lease CAC v4-64g node (64 GB Read Hat Linux). Lease option may be practicable if the node is used intensively. CAC operates in cost recovery mode, we just need to estimate fees.
ELDOR may not blend with any of CAC's blade systems, so it could be decommissioned and used locally at ACERT as a LINUX or WINDOWS 64-bit development platform. Kevin Hobbs and I can host it if CRCF would not.
Alternatively, I can use it for 3D-EM simulations, for example, since its GPUs are not used by any NLLS code. We surely can find a good use to it.
Oliver, 11/20/12, 10:23 AM
(1) Hosting the Eldore server. (Jack will be speaking with Nandini about CAC's services)