Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

Excerpt

Eldor must find a physical new home by Feb. 1st. Priority is Liang's project, which in the short term is not likely to required Eldor.

Project lead:

...

Lulu Zhu <cz67@cornell.edu>

Team: Zhichun Liang, Peter Borbat, and Lulu ZhuOliver Habicht.

Goal

Run humongous jobs, and run them hundreds of times. Expect that having own equipment is most cost-effective, but reality-check against other options such as CAC's RedCloud service (not mutually exclusive).

To buy us time, move Eldor under CRCF's management for ~6 months. This requires a new OS and file system service to replace CCMR's AFS.

Note

Required changes to Eldor are linked to moving off of CCMR's AFS. Hence the move from AFS will be being explained here, as a sub-project to the Eldor move project.

Steps and draft timeline

Ensure CIFS (replacement to AFS)

...

is working for Liang. Includes testing permissions, including writes.

Draft start date:  May 24

Duration: About two days.

Make sure everyone has alternative ways to access AFS.

Turn off Eldor

Tentative date: May 29.

Turn on Eldor, connected to CIFS

Tentative date: About a week after Eldor is turned off.

Ensure Eldor is working for Liang. Includes testing ssh, software.

Draft start date: June 6.

Duration: About two days.

Move Liang's data from AFS to CIFS. Ensure CIFS is working for Liang.

Draft start date: June 10.

Duration: About two days.

Move other users' data from AFS to CIFS one by one.

Draft start date: Week of June 10. As CIFS added for user and data moved to CIFS, turn off AFS for that user.

Duration: About one-two weeks. Do quickly to reduce overlap of some folks on AFS and some folks on CIFS.

Turn off AFS (waiting this late allows us to roll-back, if Eldor must be reverted)

Tentative date: Do soon after all accounts have been successfully moved to CIFS and have thus been turned off from AFS.

Status

Must identify all current users of Eldor

Required to coordinate the downtime required to move and reinstall the OS.

Lulu is coordinating the downtime dates and downtime duration with Liang.

4/19 Discussions Oliver had:

  • Alex using Eldor regularly. He simply needs to be notified of the downtime and expected duration of downtime. No other coordination expected, per him.
  • Peter Borbat not currently using Eldor. Nor will he be using it in the short-term. No need to keep him informed, per him.

4/17/13 (Wed): Oliver and Barry looked at Eldor's partitions. The  /a and /b partitions are each 1.7TB in size. They looked like they had no data beyond that assigned as overhead by the OS.

Also, Barry recommends we use OS v6, such as Cent OS 6.x. (Currently using RH 5.x.)

...

4/5/13 (Fri): Decision made that ChemIT (CRCF) will host new server Eldor for at least 6 months. This will provide ChemIT more time to work with Jack, Liang and others on pricing out alternative hosting options, but in the meantime putting the server back under support. This includes Lulu coordinating with Liang and others on the timing of the server's move and its new OS install with Liang (see to-do's).

Decision Liang and Peter have agreed to remove the graphic cards.(computational, GPU) video cards before Lulu upgrades the OS. These would go to Peter.

  • Oliver's understanding, based on prior conversation with Barry: There are 2 high-end GPU's and one low-end one. These would be removed. The on-board video remains, of course.

To do's

...

  • Lulu and Liang, with Barry, work out timing for when server gets moved from CCMR to ChemIT, and the necessary steps.
  • Lulu and Liang work out downtime required for Lulu to replace OS with an OS ChemIT can support.
  • ChemIT staff propose file share service to be used by Eldor post-migration, with pricing estimates.
  • File share system goes through a technical review by Barry, Liang, Peter (and others?) so they can recommend it to Jack.
  • Jack approves the  file share service his staff recommend.
  • Migration done per schedule created in previous to-do's.
    • Server is moved to ChemIT.
    • Lulu removes the GPUs and gives them to Peter.
    • Lulu installs OS.
  • Lulu and Liang to work out testing of test server post OS change.
  • Lulu and Liang to work out service windows for Lulu to do necessary periodic server administration tasks.
  • Liang confirms server is operational.

...

Peter writes (4/5/13):

Removing two power-hungry C2050 GPUs will make the process lot easier.

There is no current use for them in this system, but we need them for 3D EM simulations. There is no need to develop anything for 3D EM, since this is supported by the vendor.

At some point in the future we may develop useful CUDA support for NLLS.

This development process, however, will be well served by a smaller CUDA-enabled video card.

...

Strategy: 2-part

Get Liang set-up with CAC's services (RedCloud?) so he can create the software and test it. And pay per drink at that smaller scale.

...

Project notes and efforts

5/22/13: Lulu set up CIT SFS CIFS service, in preparation for Eldor's move to CRCF.

As of 5/22/13: Lulu worked out Linux/ CU AD integration.

As of 5/1/13: Lulu is working on Linux/ CU AD integration in preparation for Eldor's move to CRCF.

4/3/13: Oliver spoke with Liang and Peter. Liang and Peter approved removing the (computational, GPU) video cards before Lulu upgrades the OS.

...