Eldor must find a physical new home by Feb. 1st. Priority is Liang's project, which in the short term is not likely to required Eldor. |
Team: Zhichun Liang, Peter Borbat, and Oliver Habicht.
Run humongous jobs, and run them hundreds of times. Expect that having own equipment is most cost-effective, but reality-check against other options such as CAC's RedCloud service (not mutually exclusive).
To buy us time, move Eldor under CRCF's management for ~6 months. This requires a new OS and file system service to replace CCMR's AFS.
Required changes to Eldor are linked to moving off of CCMR's AFS. Hence the move from AFS will be being explained here, as a sub-project to the Eldor move project.
Draft start date: May 24
Duration: About two days.
Make sure everyone has alternative ways to access AFS.
Tentative date: May 29.
Tentative date: About a week after Eldor is turned off.
Draft start date: June 6.
Duration: About two days.
Draft start date: June 10.
Duration: About two days.
Draft start date: Week of June 10. As CIFS added for user and data moved to CIFS, turn off AFS for that user.
Duration: About one-two weeks. Do quickly to reduce overlap of some folks on AFS and some folks on CIFS.
Tentative date: Do soon after all accounts have been successfully moved to CIFS and have thus been turned off from AFS.
Required to coordinate the downtime required to move and reinstall the OS.
Lulu is coordinating the downtime dates and downtime duration with Liang.
4/19 Discussions Oliver had:
4/17/13 (Wed): Oliver and Barry looked at Eldor's partitions. The /a and /b partitions are each 1.7TB in size. They looked like they had no data beyond that assigned as overhead by the OS.
Also, Barry recommends we use OS v6, such as Cent OS 6.x. (Currently using RH 5.x.)
4/5/13 (Fri): Decision made that ChemIT (CRCF) will host Eldor for at least 6 months. This will provide ChemIT more time to work with Jack, Liang and others on pricing out alternative hosting options, but in the meantime putting the server back under support. This includes Lulu coordinating with Liang and others on the timing of the server's move and its new OS install (see to-do's).
Liang and Peter have agreed to remove the (computational, GPU) video cards before Lulu upgrades the OS. These would go to Peter.
Peter writes (4/5/13):
Removing two power-hungry C2050 GPUs will make the process lot easier.
There is no current use for them in this system, but we need them for 3D EM simulations. There is no need to develop anything for 3D EM, since this is supported by the vendor.
At some point in the future we may develop useful CUDA support for NLLS.
This development process, however, will be well served by a smaller CUDA-enabled video card.
Get Liang set-up with CAC's services (RedCloud?) so he can create the software and test it. And pay per drink at that smaller scale.
Find a new home for Eldore short-term.
Identify a sustainable home for Eldore, longer-term
5/22/13: Lulu set up CIT SFS CIFS service, in preparation for Eldor's move to CRCF.
As of 5/22/13: Lulu worked out Linux/ CU AD integration.
As of 5/1/13: Lulu is working on Linux/ CU AD integration in preparation for Eldor's move to CRCF.
4/3/13: Oliver spoke with Liang and Peter. Liang and Peter approved removing the (computational, GPU) video cards before Lulu upgrades the OS.
As of January, ACERT not charged by CCMR. This means the server is not being supported so we should move that server under support sooner than later.
http://www.cac.cornell.edu/services/projects.aspx
http://www.cac.cornell.edu/Services/rates/
Barry: Need to reinstall OS (since depends on AFS and CCMR's infrastructure).
Petr: Remove 4 GPUs, and install in Wintel boxes, one GPU per Wintel box. Barry: Giant sized GPUs: Will they fit? Recommend consult with CRCF before buying Windows-based computers, to confirm fit.
Barry: Currently: RedHat. Future idea: Scientific Linux 6 (based on RHEL 6), 64-bit.
Oliver: Noted that CRCF and CAC are using CentOS (also, RHEL-based).
Barry: ~48-64GB RAM
Liang's computations: 64-bit. X-Windows. Batch.
Jack: Reported Nandini has 40-50 notes, managed by CAC. Very positive. 3 FTE cluster experts. Rapid response. Reasonable, for the nodes. Consulting at about $60/hr. Contact is Resa Alford, <rda1>.
Kevin: If Petr and Kevin to manage: Kevin has past experience with SUSE. Might need paid consulting.
Oliver: If managed by CRCF, likely would be CentOS. Unless good reason not to be that OS.
Oliver's understanding from discussion: Everyone else doesn't need the horsepower of an modern Eldore-class system. They can use CAC, with 32-bit (Skeeve-compatible) or 64-bit.
Consider using CAC if it's a technical "fit", and pay per drink. Get estimate before committing. In the short-term, do this since need VERY high memory. BUT, note that doing this in productions likely not cost-effective compared to investing in own hardware.
Eldor will need to be reinstalled. Our installation is tied to AFS and our infrastructure. Going to windows or a standalone Linux system seems like a good idea.
It may be better to set up user accounts on or to lease CAC v4-64g node (64 GB Read Hat Linux). Lease option may be practicable if the node is used intensively. CAC operates in cost recovery mode, we just need to estimate fees.
ELDOR may not blend with any of CAC's blade systems, so it could be decommissioned and used locally at ACERT as a LINUX or WINDOWS 64-bit development platform. Kevin Hobbs and I can host it if CRCF would not.
Alternatively, I can use it for 3D-EM simulations, for example, since its GPUs are not used by any NLLS code. We surely can find a good use to it.
(1) Hosting the Eldore server. (Jack will be speaking with Nandini about CAC's services)