Lancaster group ORCA jobs' some features (like TDDFT) require more memory (For example, with PAL8, Memory needs 35304 MB) . To avoid crashing problem, Users need reduce number parallel cores on nodes with less memory. Please see the attached table for the memory size of each node. kml005, kml006, kml009, kml010 have more memories.

 

NodeOwnerMotherboard versionProcessorCoresHyperthreading onMemoryHard drive
headnodeLancasterAsus Z8PE-D12 v1401Dual E56208No24GBDual software RAID1 2TB, backintime 2TB
kml001LancasterAsus Z8PE-D12 v1401Dual E55208No24GBSoftware raid0 (3) 640GB = 1.8TB
kml002LancasterAsus Z8PE-D12 v1401Dual E55208No24GBSoftware raid0 (3) 640GB = 1.8TB
kml003LancasterAsus Z8PE-D12 v1401Dual E55208No24GBSoftware raid0 (3) 640GB = 1.8TB
kml004LancasterAsus Z8PE-D12 v1401Dual E55208No20GB*Software raid0 (3) 640GB = 1.8TB
kml005LancasterAsus Z8PE-D12 v1401Dual E56208No96GBSoftware raid0 (4) 640GB = 2.4TB
kml006LancasterAsus Z8PE-D12 v1401Dual E56208No48GBSoftware raid0 (4) 640GB = 2.4TB
kml007LancasterAsus Z8PE-D12 v1401Dual E56208No24GBSoftware raid0 (3) 640GB = 1.8TB
kml008LancasterAsus Z8PE-D12 v1401Dual E55208No24GBSoftware raid0 (3) 640GB = 1.8TB
kml009CraneSupermicro X9DRT-F v3.2Dual E5-2620v212No64GBSingle 2TB
kml010CraneSupermicro X9DRT-F v3.2Dual E5-2620v212No64GBSingle 2TB

 

Maintenance records:

3/21/16: updated kml009 and kml010 to version 3.2 from 3.0 of Supermicro X9DRT-F bios - meh26

3/21/16 Lulu: There is no error found on sda,sdb,sdc. Force fsck; no security update.

8/22/16 Lulu: No firmware update from Michael. There are no security updates available via YUM. Checked all hard drives are fine. Forced fsck, no errors found. I backed up current system to /backup/sysBackup-082216.tgz. I also removed some system crash log to free 3GB system space.

8/22/16 Lulu: No firmware update from Michael. There are no security updates available via YUM. Checked all hard drives are fine. Forced fsck, no errors found. I backed up current system to /backup/sysBackup-082216.tgz

11/16/16 Lulu: No firmware update from Michael. There are no security updates available via YUM. Checked all hard drives are fine. Forced fsck, no errors found. Backedup root partition by ddimage. /var/lib/mlocate/mlocate.db is too big ~5GB, I modified /etc/updatedb.conf to ignore sbgrid and backup then I removed the file mlocate.db and rebuilt it by "updatedb". Lulu yum updated all crane workstations; Michael updated crane synology DSM.

2/15/17 Lulu: No firmware update on cluster. There are no security updates available via YUM on cluster. Checked all hard drives are fine. Forced fsck on head node 2/12 because of power outrage, no errors found.  BIOS update on Crane workstations. yum updated all crane workstations; Michael updated crane synology DSM. Michael updated RESE-01 (serving Crane nfs) to enable auto boot after power off and on.

5/17/17 Lulu: No firmware update on cluster. Forced fsck on head node. kml001 sda is failing. kml001 has been turned to offline. BIOS update on Crane workstations. yum updated Crane nfs server(CRAN-19). yum updated all crane workstations; Michael updated crane synology DSM. Michael patched Hyper-V server where NFS server lives.

8/30/17 Lulu: No firmware update on cluster. Forced fsck on head node. BIOS update on Crane workstations. yum updated Crane nfs server(CRAN-19). yum updated all crane workstations; Michael updated crane synology DSM. Michael updates Hyper-V server Windows. as-chm-cran-14 Nvidia card fan is failing, Oliver replaced the card. We also found one monitor (for cran-13)  has some issues.

2/21/18 Lulu: No firmware update on cluster. Forced fsck on head node. BIOS update on Crane workstations. yum updated Crane nfs server(CRAN-19). yum updated all crane workstations; Michael updated crane synology DSM (Synology took really long time to boot, it maybe relate with one drive having sectors error). Michael updates Hyper-V server Windows.

6/5/18 Lulu: No firmware update on cluster & Workstations. Forced fsck on head node. yum updated Crane nfs server(CRAN-19). yum updated all crane workstations; Michael updated crane synology DSM  Michael updates Hyper-V server Windows.

8/15/18 from Michael: Crane workstations need BIOS update from 1.8.1 to 1.9.2.  NFS server aka AS-CHM-CRAN-19 move to a new hosting VM server. Cluster nodes kml001 thru kml008 and the headnode need BIOS update to v1401 from v1201

8/15 Lulu:  No security updates available via YUM on cluster. yum updated Crane nfs server(CRAN-19). yum updated all crane workstations;

6/19/2019 Lulu: No security updates available via YUM on cluster. yum updated Crane nfs server(CRAN-19). yum updated all crane workstations; Clean out some crashing log message from Lancaster headnode. (system crashes on 3/2/2019) ; I didn't update Synology and didn't do windows update on server and I didn't do firmware update on Linux workstations. We need do them and forcefsck headnode for next maintenance. kml005 /dev/sdc has some file system issue, I fixed them and it is working now.

11/10/2019 Lulu: No security updates available via YUM on cluster. yum updated Crane nfs server(CRAN-19). yum updated all crane workstations; GPU yum updated has Nvidia driver compilation issue with latest kernel. So I still user second latest kernel. Forced fsck on Synology box.

02/19/2020 Lulu:  yum updated Crane nfs server(CRAN-19). yum updated all crane workstations except GPU; no firmware update on Linux workstations. forced fsck on cluster headnode. kml006 is offline because of drive issue. as-chm-cran-12 has I/O error. but it is fine after I reboot the ystem.

 

  • No labels