Excerpt |
---|
Cluster built on Widom's headnode. 1 headnode and xx compute nodes. |
...
TOC
Table of Contents |
---|
See also
Difference from other cluster
Duo has been installed but not enabled now.
- Chemistry IT staff only: (/etc/ssh/sshd_config keeps original, we need modify this to make duo work, see duo documents)
Password lockout is enabled (Users can be locked out of their account if 20 of incorrect passwords are entered, The account will be unlocked after 600 seconds)
Iptables is enabled.
Node information
One can look up the ages of the processors, along with more technical information at at Wikipedia, which itself links to Intel's information for each processor:
Node | Motherboard version | Processor | Cores | Hyperthreading on | Memory | Hard Drive | Comments |
---|---|---|---|---|---|---|---|
headnode | Supermicro X8DTT 2.1c | Dual E5645 | 12 | N | 24GB | SW Raid 1 - (2) 2TB, no backintimelsh | |
bw001 | Supermicro X8DTT 2.1c | Dual E5645 | 12 | N | 48GB | SW Raid 0 - 2*6TB | |
bw002 | Supermicro X9DRT-F 3.2 | Dual E5-2620v2 | 12 | N | 64GB | SW Raid 0 - 2*6TB | |
bw003 | Supermicro X9DRT-F 3.2 | Dual E5-2620v2 | 12 | N | 64GB | SW Raid 0 - 2*6TB | |
bw004 | Supermicro X9DRT-F 3.2 | Dual E5-2620v2 | 12 | N | 64GB | SW Raid 0 - 2*6TB | |
bw005 | Supermicro X9DRT-F 3.2 | Dual E5-2620v2 | 12 | N | 64GB | SW Raid 0 - 2*6TB | |
bw006 | Supermicro X9DRT-F 3.2 | Dual E5-2620v2 | 12 | N | 64GB | SW Raid 0 - 2*6TB | |
bw007 | Supermicro X9DRT-F 3.2 | Dual E5-2620v2 | 12 | N | 64GB | SW Raid 0 - 2*6TB | |
rl001 | Asus DSBF-DE v1006 | Dual E5420 | 8 | N | 16GB | 1TB | |
rl002 | Asus DSBF-DE v1006 | Dual E5420 | 8 | N | 16GB | 320GB | |
rl003 | Asus DSBF-DE v1006 | Dual E5420 | 8 | N | 16GB | 320GB | |
rl004 | Asus DSBF-DE v1006 | Dual E5420 | 8 | N | 16GB | 320GB | |
rl005 | ? | ? | ? | ? | ? | ? | former loring headnode - currently off |
dbc001 | Asus DSBF-DE v1006 | Dual E5420 | 8 | N | 16GB | (1) 64GB Patriot SSD | |
dbc002 | Asus DSBF-DE v1006 | Dual E5420 | 8 | N | 8GB? | 1TB | |
dbc003 | Asus DSBF-DE v1006 | Dual E5420 | 8 | N | 16GB | 1TB | |
dbc004 | Asus Z8PE-D12 v1401 | Dual E5520 | 8 | N | 24GB | (3) 4TB | |
dbc005 | Asus Z8PE-D12 v1401 | Dual E5520 | 8 | N | 24GB | (3) 2TB | |
dbc006 | Asus Z8PE-D12 v1401 | Dual E5520 | 8 | N | 24GB | (2) 2TB | |
dbc007 | Supermicro X9DRT-F v3.2 | Dual E5-2630 | 12 | N | 32GB | (1) 3TB | |
dbc008 | Supermicro X9DRT-F v3.2 | Dual E5-2630 | 12 | N | 32GB | (1) 3TB | |
dbc009 | Asus DSBF-DE v1006 | Dual E5420 | 8 | N | 16GB | ? former SWRAID1 (2) 1TB, (1) 160GB backintime | former collum headnode - currently off |
hda001 | Asus DSBF-DE v1006 | Dual Xeon E5420 | 8 | N | 12GB | 64GB SSD | Moved to CAC 1/31/2017 |
hda002 | Asus DSBF-DE v1006 | Dual Xeon E5420 | 8 | N | 16GB | 64GB SSD | Moved to CAC 1/31/2017 |
hda003 | Asus DSBF-DE v1006 | Dual Xeon E5420 | 8 | N | 16GB | 64GB SSD | Moved to CAC 1/31/2017 |
hda004 | Asus Z8PE-D12 v1003 | Dual Xeon E5520 | 8 | Y | 24GB | 64GB SSD | Moved to CAC 1/31/2017 |
hda005 | Asus Z8PE-D12 v1401 | Dual Xeon E5520 | 8 | Y | 24GB | 250GB HD | |
hda006 | Asus Z8PE-D12 v1401 | Dual Xeon E5520 | 8 | Y | 24GB | 250GB HD | |
hda007 | Asus Z8PE-D12 v1003 | Dual Xeon E5645 | 12 | Y | 24GB | 1TB HD | |
hda008 | Asus Z8PE-D12 v1401 | Dual Xeon E5645 | 12 | Y | 24GB | 320GB HD | |
hda009 | Asus Z8PE-D12 v1401 | Dual Xeon E5645 | 12 | Y | 24GB | 320GB HD | |
"hda010" | Asus DSBF-DE v1006 | Dual Xeon E5420 | 8 | N | 8GB | dual 1TB soft raid + 1TB backup drive | former abruna headnode - Moved to CAC 1/31/2017 |
Software on this cluster
Applications | Version | Notes, for CAC |
---|---|---|
quantum espresso | 6.0 (serial and mpi version) | "easy" |
gaussian | g09-A.02 & g09-D.01 | Lulu to install, please. She compilied using PGI (PGI's use is via ChemIT's license server) |
gromacs | gromacs-5.0.5 (with/without enable-mpi & GMX_DOUBLE=on/off) | "easy" |
intel compiler | 2014 (used for compiling gromacs, openmpi and mpich2) | ChemIT license server: Needs IP of cluser |
mathematica | 10.3.0 | ChemIT license server: Needs IP of cluser |
mopac | 2009 | "easy" |
mpich | 1.2.7.p1 | "easy" |
mpich2 | 1.4.1p1 | "easy" |
openmpi | 1.6.5 | "easy" |
webmo | 17.0.010e | Difficult/ complex: Contact Lulu first |
Maintenance records
Collum cluster previous maintenance records
Abruna cluster previous maintenance records
3/4/16 - noting that the DSBF-DE motherboard has a v1007 update, but not worth the trouble for the age of the machines. X9DRT-F has a 3.2 upgrade from 3.0, can do when nodes free. - meh26
3/10/16-3/11/16 - updated X9DRT-F motherboards from 3.0 to 3.2 during downtime due to headnode drive replacement resyncing. - meh26
3/16/16 - updated IPMI card firmware from 2.0 to 3.0 - meh26
8/2/2016: Lulu: No firmware update from Michael. There are no security updates available via YUM. One hard drive (sda) is failing (smartctl found 30 errors). Forced fsck, no errors found. Modified /etc/ssh/sshd_config to keep idle ssh connection alive.
11/1/2016: Lulu: No firmware update from Michael. There are no security updates available via YUM. One hard drive (sda) is failing. Backup the root partition by ddimage .
2/5/2017: Lulu: Collum and Abruna cluster have been merged into "CLAW" cluster in Nov 2016. There was power outrage on 2/4 Saturday. No firmware update. Delete all running or queued jobs. No errors on hard drives. Forced fsck.
5/2/2017: Lulu: No firmware update. Forced fsck.
8/1/2017: Lulu: No firmware update from Michael. No errors on hard drives. Forced fsck.
2/6/2018: Lulu: No firmware update from Michael. No errors on hard drives. Forced fsck. No yum security updates.
5/1/2018: Lulu: No firmware update from Michael. No errors on hard drives.
8/7/2018: Michael:The firmware on the Z8PE-D12 motherboard in hda004-hda009 and dbc004-dbc006 was strangely updated by Asus in the last few weeks to cover the Meltdown/Spectre vulnerabilities. He has done hda005, hda006, hda008, hda009, dbc004, dbc005, and dbc006. hda007 was not posting and I didn’t feel like fighting it – a problem for another day. hda004 was sent to CAC, so how that update gets done should be discussed.
8/7/2018: Lulu: No errors on hard drives. Forced fsck. No yum security updates.
11/6/2018: Lulu: No firmware update from Michael. No errors on hard drives. No yum security updates.
2/5/2019: Lulu: No firmware update from Michael. No errors on hard drives. No yum security updates. Forced fsck. hda005 didn't passed memory test. Then it cannot power on.
5/7/2019: Lulu: No firmware update from Michael. No errors on hard drives. No yum security updates. rl003 cannot be powered on (by power button). hda005 didn't passed memory test. Then it cannot power on.
8/6/2019: Lulu: Force fsck. No firmware update from Michael. No errors on hard drives. No yum security updates. hda005 didn't passed memory test. bw001 and headnode power button affect each other. if shutdown (even by command line) / power on bw001, it will shutdown /power on head node. Michael changed power strip but does not help.
11/5/2019: Lulu: No firmware update from Michael. No errors on hard drives. No yum security updates. hda005 didn't passed memory test. bw001 and headnode power issue still existed.
2/4/2020: Lulu: Forced fsck. No errors on hard drives. No yum security updates. hda005 didn't passed memory test. bw001 and headnode power issue still existed.