Prevent delays be directing support requests to NMR, not Chemistry IT or CIT. Thank you!Summary expectations
- The owner and manager of this service is Ivan Keresztes <ik54>.
- Ivan is responsible for this server's function and continued maintenance, operations, and back-end documentation.
- Ivan is responsible for creating and implementing any desired enhancements.
- The server and its infrastructure reside or depend on resources controlled or otherwise managed by Ivan
- To be super-clear, not Chemistry IT or CIT.
If the above expectations are agreeable, no need to read further. Thank you.
======== DRAFT ========
Table on Contents
Purpose of this write-up from Oliver, Chemistry IT Manager
This write-up represents an investment by Oliver to help inspire greater clarity in problem resolution by NMR so the right people and groups work on the problem, and no delays occur by the problem being routed to the wrong groups or people. I also hope the write-up will inspire NMR staff to take additional steps to help prevent a crisis when problems do occur.
This is a public-facing web server critical to NRM's service offerings. Specifically, the tolerance for outage of this server is (define, please), per Ivan and Coates.
- Consequence if not available for hours? days? lost data going back how long? etc.)
Key recommendation from Chemistry IT
Much of the below recommendations are simply best practices. They are certainly worth investing in if the scheduling server is critically important to NMR's service delivery.
- Develop break/ fix procedures for Ivan's group which are independent from Ivan so it can serve the group when Ivan is away, if necessary.
- Document processes to ensure server software remains patched, while ensuring continued functionality. This would include, especially for a public-facing web server, patching regularly, or upgrading over time, the OS, Apache, Perl, and their associated programs.
- Line up, document, and test processes to ensure server is backed up and restorable to an acceptable period of time in the past.
Example: Contact CIT to determine if they would be capable and willing to provide expert support services via their fee services. Information available at <http://www.it.cornell.edu/about/atsus/iws/>. Explore if CIT (or other firm) could provide support backup to what Ivan knows about the server's set-up, especially important if he's away and a crisis occurs. Doing so before there are problems can increase the chance of getting expert and rapid responses, as compared with what you will get if waiting for a problem to occur. CIT (or other firm) might also be able to expertly and cost-effectively facilitate adding reasonable security, or functional enhancements, over time.
Clarifying CIT and Chemistry IT staff are not responsible
- CIT and Chemistry IT are not responsible for break/ fix of the NMR web scheduler or any of its or related infrastructure.
- CIT and Chemistry IT are not responsible for enhancements to the NMR web scheduler or related infrastructure.
Contextual, technical, and historical background, FYI
Contextual, technical information, as far as Oliver understands it
- The service runs on an Apache web server running on a Linux server, and depends on files and Perl scripts.
- Q: What are the current versions of the software, and will any of them start to lag?
- The Linux server is hosted within Amazon Web Services (AWS), via Cornell's contract.
- This incurs a monthly charge (amount?).
- The server is managed remotely by Ivan.
N.B. The AWS charges are currently going through CIT. CIT is processing the charge to their account as a favor to Chemistry so we did not have to create an account ourselves with AWS. (This can be changed, if desired.) CIT currently has no other persistent responsibilities or connections to this server.
Historical background, from Chemistry IT's perspective
Chemistry IT has served as trusted consultants to NMR regarding this server, including helping NMR get it migrated to (Amazon Web Services).
- Following Chemistry IT's conception of Amazon Web Services (AWS) as a hosting solution, Chemistry IT provided assistance and encouragement to have Ivan work with CIT staff. For free, CIT provided a generous amount of consulting technical expertise, and implementation work, and debugging to migrate the server from the extremely old hardware in 248 Baker Lab, and old software, into the Amazon Web Services (AWS) infrastructure. CIT ensured correctly configured networking. They also de-bugged the software to ensure it would run correctly on more contemporary software (Linux OS, Apache, and Perl). Migration occurred Tuesday, Oct. 11, 2016, from about 8:45 am to 10-ish.
- Ivan signed off on migration's success on (date?).
- Q: Was the last problem detected by NMR on 12/19/16? (With CIT's help, hopefully that was subsequently resolved to completion.) That incident inspired this write-up since Chemistry IT was contacted by NMR staff seeking assistance and we could not help. That's break/fix process is not good for NMR nor Chemistry IT.
Chemistry IT had hosted the hardware in 248 Baker Lab for many years, from before Oliver's arrival.
- The server had been in B-71 ST Olin before being moved into 248 Baker Lab.
- Shortly after Oliver's arrival in 2012, Oliver notified Ivan of our group's reluctance to continue hosting the server since we were concerned we would be inadvertently drawn into dealing with a preventable crisis. The risk of the crisis was high since the service was critical to NMR's service delivery and the server's hardware was so very old, as was the software it depended on.
- When the server was finally migrated off the hardware, that hardware was about 13 years old.
- Also, as a public-facing web server, we judged that it was unacceptably neglected in terms of best practices as well as practices defined by University policy and expectations. For example, it was not being patched or updated regularly or timely, if at all, against security vulnerabilities.
- It had been running an OS version from about October 2005 (RHEL 4.2) (Is this correct OS and date?)).
- Early in Oliver's tenure, Ivan hoped to re-write the scheduler. Those plans fell through over the following 4 years.
- The continued neglect of the server, and its increased potential for a crisis, kept on growing as the years passed.