Cluster built on Widom's headnode. 1 headnode and xx compute nodes.


TOC

See also

Difference from other cluster

Duo has been installed but not enabled now.

  • Chemistry IT staff only: (/etc/ssh/sshd_config keeps original, we need modify this to make duo work, see duo documents)

Password lockout is enabled (Users can be locked out of their account if 20 of incorrect passwords are entered, The account will be unlocked after 600 seconds)

Iptables is enabled.

Node information

One can look up the ages of the processors, along with more technical information at at Wikipedia, which itself links to Intel's information for each processor:

 

NodeMotherboard versionProcessorCoresHyperthreading onMemoryHard DriveComments
headnodeSupermicro X8DTT 2.1cDual E564512N24GBSW Raid 1 - (2) 2TB, no backintimelsh 

bw001

Supermicro X8DTT 2.1cDual E564512N48GBSW Raid 0 - 2*6TB 
bw002Supermicro X9DRT-F 3.2Dual E5-2620v212N64GBSW Raid 0 - 2*6TB 
bw003Supermicro X9DRT-F 3.2Dual E5-2620v212N64GBSW Raid 0 - 2*6TB 
bw004Supermicro X9DRT-F 3.2Dual E5-2620v212N64GBSW Raid 0 - 2*6TB 
bw005Supermicro X9DRT-F 3.2Dual E5-2620v212N64GBSW Raid 0 - 2*6TB 
bw006Supermicro X9DRT-F 3.2Dual E5-2620v212N64GBSW Raid 0 - 2*6TB 
bw007Supermicro X9DRT-F 3.2Dual E5-2620v212N64GBSW Raid 0 - 2*6TB 
rl001Asus DSBF-DE v1006Dual E54208N16GB1TB 
rl002Asus DSBF-DE v1006Dual E54208N16GB320GB 
rl003Asus DSBF-DE v1006Dual E54208N16GB320GB 
rl004Asus DSBF-DE v1006Dual E54208N16GB320GB 

rl005

??????former loring headnode - currently off

dbc001

Asus DSBF-DE v1006

Dual E5420

8N

16GB

(1) 64GB Patriot SSD

 

dbc002

Asus DSBF-DE v1006

Dual E5420

8N

8GB?

1TB

 

dbc003

Asus DSBF-DE v1006

Dual E5420

8N

16GB

1TB

 

dbc004

Asus Z8PE-D12 v1401

Dual E5520

8N

24GB

(3) 4TB

 

dbc005

Asus Z8PE-D12 v1401

Dual E5520

8N

24GB

(3) 2TB

 

dbc006

Asus Z8PE-D12 v1401

Dual E5520

8N

24GB

(2) 2TB

 

dbc007

Supermicro X9DRT-F v3.2

Dual E5-2630

12N

32GB

(1) 3TB

 

dbc008

Supermicro X9DRT-F v3.2

Dual E5-2630

12N

32GB

(1) 3TB

 

dbc009

Asus DSBF-DE v1006

Dual E5420

8N

16GB

?

former SWRAID1 (2) 1TB, (1) 160GB backintime

former collum headnode - currently off

hda001

Asus DSBF-DE v1006

Dual Xeon E5420

8N12GB64GB SSDMoved to CAC 1/31/2017

hda002

Asus DSBF-DE v1006Dual Xeon E54208N16GB64GB SSDMoved to CAC 1/31/2017

hda003

Asus DSBF-DE v1006Dual Xeon E54208N16GB64GB SSDMoved to CAC 1/31/2017

hda004

Asus Z8PE-D12 v1003Dual Xeon E55208Y24GB64GB SSDMoved to CAC 1/31/2017
hda005Asus Z8PE-D12 v1401Dual Xeon E55208Y24GB250GB HD 
hda006Asus Z8PE-D12 v1401Dual Xeon E55208Y24GB250GB HD 
hda007Asus Z8PE-D12 v1003Dual Xeon E564512Y24GB1TB HD 
hda008Asus Z8PE-D12 v1401Dual Xeon E564512Y24GB320GB HD 
hda009Asus Z8PE-D12 v1401Dual Xeon E564512Y24GB320GB HD 
"hda010"Asus DSBF-DE v1006Dual Xeon E54208N8GB

dual 1TB soft raid + 1TB backup drive

former abruna headnode - Moved to CAC 1/31/2017

Software on this cluster

ApplicationsVersionNotes, for CAC
quantum espresso6.0 (serial and mpi version)"easy"
gaussiang09-A.02 & g09-D.01Lulu to install, please. She compilied using PGI (PGI's use is via ChemIT's license server)
gromacsgromacs-5.0.5 (with/without enable-mpi & GMX_DOUBLE=on/off)"easy"
intel compiler2014 (used for compiling gromacs, openmpi and mpich2)ChemIT license server: Needs IP of cluser
mathematica10.3.0ChemIT license server: Needs IP of cluser
mopac2009"easy"
mpich1.2.7.p1"easy"
mpich21.4.1p1"easy"
openmpi1.6.5"easy"
webmo

17.0.010e

Difficult/ complex: Contact Lulu first


Maintenance records

Collum cluster previous maintenance records

Abruna cluster previous maintenance records

3/4/16 - noting that the DSBF-DE motherboard has a v1007 update, but not worth the trouble for the age of the machines. X9DRT-F has a 3.2 upgrade from 3.0, can do when nodes free. - meh26

3/10/16-3/11/16 - updated X9DRT-F motherboards from 3.0 to 3.2 during downtime due to headnode drive replacement resyncing. - meh26

3/16/16 - updated IPMI card firmware from 2.0 to 3.0 - meh26

8/2/2016: Lulu: No firmware update from Michael. There are no security updates available via YUM. One hard drive (sda) is failing (smartctl found 30 errors). Forced fsck, no errors found. Modified /etc/ssh/sshd_config to keep idle ssh connection alive.

11/1/2016: Lulu: No firmware update from Michael. There are no security updates available via YUM. One hard drive (sda) is failing. Backup the root partition by ddimage .

2/5/2017: Lulu: Collum and Abruna cluster have been merged into "CLAW" cluster in Nov 2016. There was power outrage on 2/4 Saturday.  No firmware update. Delete all running or queued jobs. No errors on hard drives. Forced fsck.

5/2/2017: Lulu: No firmware update. Forced fsck.

8/1/2017: Lulu: No firmware update from Michael. No errors on hard drives. Forced fsck.

2/6/2018: Lulu: No firmware update from Michael. No errors on hard drives. Forced fsck. No yum security updates.

5/1/2018: Lulu: No firmware update from Michael. No errors on hard drives.

8/7/2018: Michael:The firmware on the Z8PE-D12 motherboard in hda004-hda009 and dbc004-dbc006 was strangely updated by Asus in the last few weeks to cover the Meltdown/Spectre vulnerabilities. He has  done hda005, hda006, hda008, hda009, dbc004, dbc005, and dbc006. hda007 was not posting and I didn’t feel like fighting it – a problem for another day. hda004 was sent to CAC, so how that update gets done should be discussed.

8/7/2018: Lulu:  No errors on hard drives. Forced fsck. No yum security updates.

11/6/2018: Lulu: No firmware update from Michael. No errors on hard drives. No yum security updates.

2/5/2019: Lulu: No firmware update from Michael. No errors on hard drives. No yum security updates. Forced fsck. hda005 didn't passed memory test. Then it cannot power on.

5/7/2019: Lulu: No firmware update from Michael. No errors on hard drives. No yum security updates. rl003 cannot be powered on (by power button). hda005 didn't passed memory test. Then it cannot power on.

8/6/2019: Lulu: Force fsck. No firmware update from Michael. No errors on hard drives. No yum security updates. hda005 didn't passed memory test. bw001 and headnode power button affect each other. if shutdown (even by command line) / power on bw001, it will shutdown /power on head node. Michael changed power strip but does not help.

11/5/2019: Lulu: No firmware update from Michael. No errors on hard drives. No yum security updates. hda005 didn't passed memory test. bw001 and headnode power issue still existed.

2/4/2020: Lulu: Forced fsck. No errors on hard drives. No yum security updates. hda005 didn't passed memory test. bw001 and headnode power issue still existed.

  • No labels