See also
- UPS inventory and status, mostly within ChemIT's Baker 248 server room
- Power outage record for Monday, April 11th
ChemIT's record of recent power outages
Date | Outage duration | Cause | Official link | ChemIT notes |
---|---|---|---|---|
Over 2 hours, 5:45-8pm, although got power on temporarily for a short time near the end of this interval. Started Saturday, late afternoon: Sent: Saturday, February 4, 2017 5:46 PM ============= Sat 2/4/2017 7:44 PM, Micheal wrote: 7:15pm to the . I got in and was starting to look at what was up and what was down, but then I lost access to everything. I think the room lost power again at 7:35pm. Sat 2/4/2017 9:12 PM, Michael wrote: Power appears to have come back at 8:05pm ============= | https://itservicealerts.hosting.cornell.edu/view/4655 (Nominal start reported was delayed by ~ 1 hour, "Event: 2017-02-04 18:44:00" (6:44). Per reported, largely resolved "As of 7:45pm, Cornell reported that power had been restored.") | Michael had logged in, noted system had shut down. And was himself kicked out, indicating a second power outage during this time. NetAdmin-L alerted potential ~10 minutes before Alert: From: On Behalf Of Jamie Rosner Duong Are parts of Cornell experiencing a power outage? ============== Early on, "NYSEG has estimated restoration around 8:15-8:30pm" | ||
4/11/16 Monday | About 40 minutes, starting shortly after noon. Official emails: Mon, Apr 11, 2016 at 12:25 PM: CornellALERT: ITHACA CAMPUS POWER OUTAGE Mon, Apr 11, 2016 at 12:59 PM: CornellALERT: ITHACA CAMPUS POWER OUTAGE - UPDATE 12:48 | Sun article, quoting Melissa Hines: http://cornellsun.com/2016/04/11/damaged-transmission-line-responsible-for-cornell-power-outage/ | Servers mostly OK. Some UPS failings. Some router failings. See our notes on this particular outage. | |
2/27/2014 | 2 minutes | Procedural error? | No link to info in 3/3/14 email? | Michael led our effort to initially evaluate and restore systems, with Oliver adding to documentation and to-do's. Lulu completed the cluster restoration efforts. |
1/27/2014 | 17-19 minutes | ? | Lulu, Michael, and Oliver shut down head nodes and other systems which were on UPS. (Those systems non UPS shut down hard, per usual.) | |
12/23/2013 | 2 minutes | Procedural error? | Terrible timing, right before the longest staff holiday of the year. | |
7/17/13 | Half a morning (~2 hours) |
|
|
Question: When power is initially restored, do you trust it? Or might it simply kick back off in some circumstances?
- Because we don't know the answer to this question following any specific power outage, we are reluctant to turn back on servers right away. Instead, we like to wait ~10-25 minutes.
Procedures and reminders
Reboot switches?
- 8 switches support the clusters (out of 12 in the room).
Start head nodes, if not on already
- Only a few are on UPS. Those can obviously be left on.
- None should be set to auto-start on power-off.
Confirm head nodes accessible via SSH
PuTTY on Windows
Use FMPro to get connection info?! (not the right info there, though...)
Menu => List Machine / User Selections = > List Group or selected criteria
- Machine function => Server HN
Start compute nodes
If nodes done show up, consider:
- Restart Torque scheduler on problematic nodes.
- Try rebooting the switch the affected nodes are connected to, especially if the problematic nodes are grouped to a single switch.
- Hook up a monitor as one of the high nodes boot.