Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

Excerpt

NSF grant awarded. Thus, this project is a "go" as of August 2013.

...

Summary information for Hoffmann

...

researchers, as of 12/12/13:

Helios and Sol will be unavailable to researchers between Monday, Jan. 6th at 9am and approximately Thursday, Jan 9th, at noon.

  • Before the shutdown, all data of value must be copied off of Sol.
    • All user directory data will be removed.
  • Before the shutdown, all jobs on both Sol and Helios must be stopped.
    • Any remaining jobs will be stopped by ChemIT staff
  • Only Sol will be up after this work is done.(Helios will no longer be available after all this is done.)
    • Data will be moved from Helios to Sol.
    • And Helios's nodes will be added to Sol.

Detailed information for cluster migration:

1) Reduce the amount of researchers’ temporary, test user data on Sol to the bare minimum during the break.

  • After the break, ChemIT will be deleting the content within everyone's test home directories on Sol. (Those directories

Recap:

During this recent test period in November, researchers confirmed that the new system's hardware and base configuration fully enables the proper functioning of all the software they currently depend on within Helios. We discovered important things that we could not have learned without this testing investment, which in turn will contribute to a more robust setup for your researchers.

Upcoming to do's for Hoffmann researchers, as of 11/27/13:

(1) As soon as possible, ensure researchers' user data on Sol can be completely deleted.

  • We will be deleting all the home directories. (They were originally copied there as a testing convenience during our test phase.)
    • We do this deletion before the final copy to
    enable
    • ensure a clean import of
    researchers'
    • researchers’ current production home directories from Helios.
    • Lulu
    has
    • advised researchers of this
    from
    • step before the Thanksgiving break, so hopefully no surprises.
  • During the winter break, Sol's data will be backed up.
    • As with Helios, Sol's data will be copied and versioned to a dedicated, internal hard drive.
    • NEW: EZ-Backup will be enabled on Sol, providing off-site backup protection.

Bottom line: If there are any files on Sol which researchers cannot afford to lose (hopefully not- it was in test mode, and it's not being backed up!), researchers must move that data to Helios right away (deadline?)or elsewhere) before Sol is turned off.

2)

...

Monday, Jan. 6th at 9am: Helios AND Sol being turned off to researchers

Researchers must know that:

  • ALL jobs must cease by then. Any remaining jobs will be turned off by ChemIT.
  • Also, there must be no remaining research data on Sol.

Tasks for ChemIT to do:

  • Copy researchers' production home directory data from Helios to Sol.
  • Move Helios's compute nodes into Sol's rack and reconnect all networking and power.
    • This allows us to add the older compute nodes from Helios, making Sol more powerful.
  • Testing by ChemIT.
    • Coordinate with Huayun (Deng) for initial testing, perhaps sometime Wednesday.

3) Thursday, Jan 9th, at noon: Expect Sol available to researchers.

  • At this point, Sol will be up in full production and open to all researchers, with some or all of Helios's compute nodes attached.

The Hoffmann group is encouraged to improve their software installations in order to create a more robust environment and to improve support outcomes.

Tasks for Hoffmann group:

  • Confirm that Sol is "good". That is, there is no need to "roll back" to using Helios's head node and for us to start again.
  • Hoffmann group members are encouraged to install their shared group software in the /home/hoffmann/bin/ directory. Including:
    • Materials Studio software. (Who?
  • Sol will need to be turned off and worked on by us. (Approximate duration?) Therefore, it will be off-line to researcher.
  • Ad hoc permission to select researches (Prasad?) will be granted so they may install and test their software.
  • Tasks for us to do:
    • Get the server's versioning hard disk recognized within the new head node.
    • Lulu to install the Intel compiler in the /software/ directory, with the proper licensing.
    • Your group (Prasad?) to install the Materials Studio software in the /data/ Hoffmann/bin/ directory. If Prasad is not available to do this, let's let’s discuss our your options- thank you.

3) On a predetermined date, researchers will be asked to test select software on Sol.

  • Changes to where the Intel compiler and Materials Studio reside (to facilitated debugging and support) must be confirmed to work for researchers.
  • Again we are testing on Sol. Thus, the only production work should still be on Helios, not Sol.

REMINDER: Researchers must again ensure their user data on Sol can be completely deleted.

4) On a predetermined date, researchers will again stop having access to Sol until it is ready for production.

  •  During this time we will sync researchers' home directories from Helios
    QUESTION for Lulu and Michael: Is this really better than giving each researcher a bare home directory and asking them to move only the data from Helios which they need?

5) On a predetermined date, researchers must plan for 1-2 days of downtime for Helios, after which Sol will be available and in full production.

    • )
    • vasp
    • gulp
    • etc.

Tasks for ChemIT:

  • Once Hoffmann researchers confirm Sol is "good", Helios's head node can be added as a compute node to Sol.

NOTE: Helios will no longer be available after all this is done.

  • Data will be moved from Helios to Sol.
  • And Helios's nodes will be added
  • We obviously must coordinate this cut-over to ensure no loss of researcher's production data and to minimize the inconvenience of the downtime.
  • With Sol in production, we can:
  • add the older compute nodes from Helios, making Sol more and more powerful.
  • once Sol is confirmed "good", Helios's head node can be added as a compute node to Sol.

Thank you -ChemIT

...

Older or other notes:

Data rates

12/11/13: Users data to transfer from Helios to Sol is about 80GB. The transfer time, using rsync, is expected to take about 24 hours.

Next steps

  • Meet to review all options and confirm desired direction and expected timing.
  • Review resources. Huayun Gen has cluster management experience, including set-up.

...