Excerpt |
---|
Too many power outages in Baker Lab, and other Chem buildings! |
ChemIT's record of recent power outages
Date | Outage duration | Cause | Official link | ChemIT notes |
---|---|---|---|---|
2/27/2014 | 2 minutes | ? | Some delayed time-stamps, from the IT folks: | Michael led our effort to initially evaluate and restore systems, with Oliver adding to documentation and to-do's. Lulu completed the cluster restoration efforts. |
1/27/2014 | 17-19 minutes | ? | Lulu, Michael, and Oliver shut down headnodes and other systems which were on UPS. (Those systems non UPS shut down hard, per usual.) | |
12/23/2013 | 2 minutes | Human error? | Terrible timing, right before the longest staff holiday of the year. | |
7/17/13 | Half a morning (~2 hours) |
|
|
...
Cluster | Done | Not done | Notes |
---|---|---|---|
Collum | X | Sprin | |
Lancaster, with Crane (new) | X |
| Funded by Crane. |
Hoffmann | X |
|
|
Scheraga | X |
| See below chart for s4 tand-alone computational computers |
Loring |
| X | Unique: Need to do ASAP |
Abruna |
| X | Unique: Need to do ASAP |
C4 Headnode: pilot | X |
| Provisioned on the margin, since still a pilot. |
Widom |
| X | See "C4", above |
...
Computer | Done | Note done | Notes |
---|---|---|---|
Scheraga's 4 GPU rack-mounted computational computers |
| X | Need to protect? |
NMR web-based scheduler | X |
|
|
Coates: MS SQL Server |
| X | Unique: Need to do ASAP |
Do all switches: Maybe ~$ 340 ~$340 ($170*2), every ~4 years.
- Recommend: Do ASAP.
- Q: Funding?
- Other issues and concerns with actually implementing this approach: Rack space. Maybe power. Maybe cord lengths. What other issues?
Do all compute nodes: ~$18K every ~4 years.
- ~20 20amp UPS's ($900 each) required.
- These are from back-of-the-envelope calculations.
- If were to actually implement, may be smarter ways to do this, but cost will likely not be lower. (In fact, costs may be higher, if sufficiently higher benefit doing it a different way, for example.
- Issues and concerns with actually implementing this approach: Costs. Rack space. Maybe power. Maybe cord lengths. What other issues?
Compute node counts, for UPS pricing estimates. Does not include head node:
Cluster | Done | Not done | Notes |
---|---|---|---|
Collum | X |
| Sprin |
Lancaster, with Crane (new) | X |
| Funded by Crane. |
Hoffmann | X |
|
|
Scheraga | X |
| See below chart for s4 tand-alone computational computers |
Loring |
| X | Unique: Need to do ASAP |
Abruna |
| X | Unique: Need to do ASAP |
C4 Headnode: pilot | X |
| Provisioned on the margin, since still a pilot. |
Widom |
| X | See "C4", above |
Procedures and reminders
Reboot switches?
- 8 switches support the clusters (out of 12 in the room).
Start headnodes, if not on already
- Only a few are on UPS. Those can obviously be left on.
- None should be set to auto-start on power-off.
Confirm headnodes accessible via SSH
PuTTY on Windows
Use FMPro to get connection info?! (not the right info there, though...)
Menu => List Machine / User Selections = > List Group or selected criteria
- Machine function => Server HN
Start compute nodes
If nodes done show up, consider:
...