99problems 37保卫萝卜2第99关攻略略

点击联系发帖人 时间：2015-03-13 08:23

开心消消乐99关攻略

SSD replace in IO node
# ssh ion-1-9
INTEL SSDSA2BZ30 0362
INTEL SSDSA2BZ30 0362
<<<<<<< ../../sdn
lrwxrwxrwx 1 root root
2014 pci-.0-sas-0x0000-lun-0 -> ../../sdo
lrwxrwxrwx 1 root root
2014 pci-.0-sas-0x0000-lun-0 -> ../../sdp
lrwxrwxrwx 1 root root
2014 pci-.0-sas-0x0000-lun-0 -> ../../sdq
lrwxrwxrwx 1 root root
8 11:04 pci-.0-sas-0x0000-lun-0 -> ../../sds
lrwxrwxrwx 1 root root 10 Mar
8 11:04 pci-.0-sas-0x0000-lun-0-part1 -> ../../sds1
lrwxrwxrwx 1 root root 10 Mar
8 11:04 pci-.0-sas-0x0000-lun-0-part2 -> ../../sds2
lrwxrwxrwx 1 root root 10 Mar
8 11:04 pci-.0-sas-0x0000-lun-0-part3 -> ../../sds3
lrwxrwxrwx 1 root root 10 Mar
8 11:04 pci-.0-sas-0x0000-lun-0-part4 -> ../../sds4
lrwxrwxrwx 1 root root 10 Mar
8 11:04 pci-.0-sas-0x0000-lun-0-part5 -> ../../sds5
lrwxrwxrwx 1 root root 10 Mar
8 11:04 pci-.0-sas-0x0000-lun-0-part6 -> ../../sds6
#fdisk /dev/sds
print partition table
remove partitions
[root@gcn-17-68 rocksconfig.d]#
/sbin/iscsiadm -m discovery -t sendtargets -p 10.7.104.104 -v iser
iscsiadm: Could not stat /var/lib/iscsi/nodes//,3260,-1/default to delete node: No such file or directory
iscsiadm: Could not add/update [tcp:[hw=,ip=,net_if=,iscsi_if=default] 10.7.104.104,3260,1 sdr]
10.7.104.104:3260,1 sdr
remove existing directory: /var/lib/iscsi/nodes/sdr/10.7.104.104,3260,1/default
Starting iscsi:
mkfs.xfs: cannot open /dev/sdb: Device or resource busy
# mdadm --stop
/dev/md127
# /etc/rc.d/rocksconfig.d/post-96-iser
meta-data=/dev/sdb
agcount=4, agsize= blks
ssh connection fail
Connection closed by remote host
SSH has a limit to the number of outstanding (unauthenticated) connections.
See the "MaxStartups" sshd_config variable.
What you are seeing is what
happens when that limit gets hit.
We've noticed this problem crop up on HPC when a home NFS server gets
congested and the cluster gets hit by an SSH password scan.
triggers an automount of the victim user's home directory (to read the
authorized_keys file), which blocks because the NFS server is busy.
the guessing happens in parallel, a bunch of additional sshd processes also
get blocked and the MaxStartups limit gets hit.
We've also had the limit get hit from unusually intense scans as well.
lustre mount fails
Feb 10 09:42:00 gcn-16-11 kernel: LustreError: 15c-8: MGC172.25.32.253@tcp: The configuration from log'monkey-client' failed (-5). This may be the result of communication errors between this node and the MGS, a badconfiguration, or other errors. See the syslog for more information.
Feb 10 09:42:00 gcn-16-11 kernel: LustreError: 21399:0:(llite_lib.c:1046:ll_fill_super()) Unable to process log: -5
Feb 10 09:42:00 gcn-16-11 kernel: Lustre: Unmounted monkey-client
check route on IO node:
172.25.32.0
192.168.230.1
255.255.254.0
lustre mounts
192.168.95.245@tcp:/rhino /rhino lustre defaults,_netdev,flock,nodev,nosuid
172.25.33.252@tcp:172.25.33.124@tcp:/puma
/oasis/scratch-trestles
172.25.32.253@tcp:172.25.32.125@tcp:/monkey
/oasis/scratch
lustre defaults,_netdev,flock,nodev,nosuid
172.25.33.53@tcp:172.25.33.25@tcp:/meerkat
/oasis/projects/nsf lustre defaults,_netdev,flock,nodev,nosuid 0 0
172.25.32.53@tcp:172.25.32.25@tcp:/dolphin
/oasis/tscc/scratch
198.202.105.9:/xwfs
nfsvers=3,rw,nosuid,nodev,bg,soft,intr,tcp
192.168.16.6@tcp:192.168.24.6@tcp:/panda
/oasis/projects/nsf
172.25.32.125@tcp:172.25.32.253@tcp:/monkey
/oasis/scratch
192.168.95.134@tcp:192.168.95.135@tcp:/seahorse
/oasis/tscc/scratch
192.168.0.6@tcp:192.168.8.6@tcp:/wombat
/oasis/comet/scratch
(or something similar)
MGS/PST/OSS
oasis-panda.sdsc.edu is 192.168.111.17
oasis-wombat.sdsc.edu is 192.168.111.16
set up /etc/sysconfig/static-routes
any net 172.25.32.0 netmask 255.255.254.0 gw 192.168.230.1
any host 192.168.110.100 gw 10.5.1.1
any net 192.168.8.0 netmask 255.255.248.0 gw 192.168.230.1
any host 192.168.110.102 gw 10.5.1.1
any net 192.168.16.0 netmask 255.255.248.0 gw 192.168.230.1
any net 192.168.0.0 netmask 255.255.248.0 gw 192.168.230.1
any net 192.168.24.0 netmask 255.255.248.0 gw 192.168.230.1
any net 224.0.0.0 netmask 255.255.255.0 dev eth0
any host 255.255.255.255 dev eth0
for rack1 rack10 rack12 login data-mover
rocks run host rack1 command=" route add -net 192.168.0.0 netmask 255.255.248.0 gateway 192.168.230.1"
rocks run host rack1 command=" route add -net 192.168.8.0 netmask 255.255.248.0 gateway 192.168.230.1"
rocks run host rack1 command=" route add -net 192.168.16.0 netmask 255.255.248.0 gateway 192.168.230.1"
rocks run host rack1 command=" route add -net 192.168.24.0 netmask 255.255.248.0 gateway 192.168.230.1"
MOM time out
J2325239.gordon-fe2.unable to run job, send to MOM '' failed
check for rejecting nodes:
cd /var/spool/torque/server_ grep "to run job, send to MOM" 201511* | awk '{ print $9 }' |uniq
= 10.5.103.88
= 10.5.100.253
job not started
10/01/:12;0008;
pbs_mom.40783;Jresend_waiting_Successfully re-sent join job request to gcn-14-71
10/01/:25;0002;
pbs_mom.40783;Spbs_Torque Mom Version = 4.2.6.1, loglevel = 0
10/01/:40;0001;
pbs_mom.40783;Jexamine_all_running_job 2240913.gordon-fe2.local already examined. substate=40
10/01/:40;0001;
pbs_mom.40783;Jexec_bailing on job 2240913.gordon-fe2.local code -4
10/01/:40;0008;
pbs_mom.40783;Rsend_sending ABORT to sisters for job 2240913.gordon-fe2.local
10/01/:06;0001;
pbs_mom.40783;Jjob_0: gcn-17-36/0
10/01/:06;0001;
pbs_mom.40783;Jjob_1: gcn-17-36/1
10/01/:06;0001;
pbs_mom.40783;Jjob_2: gcn-17-36/2
10/01/:06;0001;
pbs_mom.40783;Jjob_3: gcn-17-36/3
10/01/:06;0001;
pbs_mom.40783;Jjob_4: gcn-17-36/4
10/01/:06;0001;
pbs_mom.40783;Jjob_5: gcn-17-36/5
10/01/:06;0001;
pbs_mom.40783;Jjob_6: gcn-17-36/6
10/01/:06;0001;
pbs_mom.40783;Jjob_7: gcn-17-36/7
10/01/:06;0001;
pbs_mom.40783;Jjob_8: gcn-17-36/8
10/01/:06;0001;
pbs_mom.40783;Jjob_9: gcn-17-36/9
10/01/:06;0001;
pbs_mom.40783;Jjob_10: gcn-17-36/10
10/01/:06;0001;
pbs_mom.40783;Jjob_11: gcn-17-36/11
10/01/:06;0001;
pbs_mom.40783;Jjob_12: gcn-17-36/12
10/01/:06;0001;
pbs_mom.40783;Jjob_13: gcn-17-36/13
10/01/:06;0001;
pbs_mom.40783;Jjob_14: gcn-17-36/14
10/01/:06;0001;
pbs_mom.40783;Jjob_15: gcn-17-36/15
10/01/:06;0001;
pbs_mom.40783;Jjob_16: gcn-18-26/0
10/01/:06;0001;
pbs_mom.40783;Jjob_17: gcn-18-26/1
10/01/:06;0001;
pbs_mom.40783;Jjob_18: gcn-18-26/2
10/01/:06;0001;
pbs_mom.40783;Jjob_19: gcn-18-26/3
10/01/:06;0001;
pbs_mom.40783;Jjob_20: gcn-18-26/4
10/01/:06;0001;
pbs_mom.40783;Jjob_21: gcn-18-26/5
10/01/:06;0001;
pbs_mom.40783;Jjob_22: gcn-18-26/6
10/01/:06;0001;
pbs_mom.40783;Jjob_23: gcn-18-26/7
10/01/:06;0001;
pbs_mom.40783;Jjob_24: gcn-18-26/8
10/01/:06;0001;
pbs_mom.40783;Jjob_25: gcn-18-26/9
10/01/:06;0001;
pbs_mom.40783;Jjob_26: gcn-18-26/10
10/01/:06;0001;
pbs_mom.40783;Jjob_27: gcn-18-26/11
10/01/:06;0001;
pbs_mom.40783;Jjob_28: gcn-18-26/12
10/01/:06;0001;
pbs_mom.40783;Jjob_29: gcn-18-26/13
10/01/:06;0001;
pbs_mom.40783;Jjob_30: gcn-18-26/14
10/01/:06;0001;
pbs_mom.40783;Jjob_31: gcn-18-26/15
10/01/:06;0001;
pbs_mom.40783;Jjob_job: 2240973.gordon-fe2.local numnodes=2 numvnod=32
10/01/:06;0008;
pbs_mom.43.gordon-fe2.req_commit:job execution started
10/01/:55;0100;
pbs_mom.40783;R;Type StatusJob request received from PBS_Server@gordon-fe2, sock=9
10/01/:55;0008;
pbs_mom.40783;Jmom_process_request type StatusJob from host gordon-fe2 received
10/01/:55;0008;
pbs_mom.40783;Jmom_process_request type StatusJob from host gordon-fe2 allowed
10/01/:55;0008;
pbs_mom.40783;Jmom_dispatch_dispatching request StatusJob on sd=9
10/01/:05;0080;
pbs_mom.40783;Smom_get_proc_array load started
10/01/:05;0002;
pbs_mom.40783;Sget_cpuset_/dev/cpuset/torque contains 0 PIDs
10/01/:05;0080;
pbs_mom.40783;n/a;mom_get_proc_array loaded - nproc=0
10/01/:05;0080;
pbs_mom.40783;n/a;cput_proc_array loop start - jobid = 2240973.gordon-fe2.local
10/01/:05;0080;
pbs_mom.40783;n/a;mem_proc_array loop start - jobid = 2240973.gordon-fe2.local
10/01/:05;0080;
pbs_mom.40783;n/a;resi_proc_array loop start - jobid = 2240973.gordon-fe2.local
10/01/:40;0001;
pbs_mom.40783;Jexec_bailing on job 2240973.gordon-fe2.local code -4
10/01/:40;0008;
pbs_mom.40783;Rsend_sending command ABORT_JOB for job 2240973.gordon-fe2.local (10)
10/01/:40;0008;
pbs_mom.40783;Rsend_sending ABORT to sisters for job 2240973.gordon-fe2.local
10/01/:40;0080;
pbs_mom.40783;Sscan_for_searching for exiting jobs
10/01/:40;0080;
pbs_mom.40783;Sscan_for_working on a job
10/01/:40;0008;
pbs_mom.40783;Jkill_scan_for_exiting: sending signal 9, "KILL" to job 2240973.gordon-fe2.local, reason: local task termination detected
10/01/:40;0008;
pbs_mom.43.gordon-fe2.kill_job done (killed 0 processes)
10/01/:06;0001;
pbs_mom.3005;Spbs_LOG_ERROR::No child processes (10) in tcp_request, bad connect from 10.5.103.64:939
check tcp connection:
nmap -p .103.64 -oG-
Starting Nmap 5.51 ( http://nmap.org ) at
Nmap scan report for 10.5.103.64
Host is up (0.00034s latency).
15002/tcp filtered unknown
MAC Address: 00:1E:67:29:53:3A (Intel Corporate)
Nmap done: 1 IP address (1 host up) scanned in 0.16 seconds
check mom configuration:
# momctl -d3
Host: gcn-17-37/gcn-17-37.local
Version: 4.2.6.1
Server[0]: gordon-fe2 (10.5.1.2:15001)
Last Msg From Server:
799 seconds (CLUSTER_ADDRS)
no messages sent to server
HomeDirectory:
/var/spool/torque/mom_priv
stdout/stderr spool directory: '/var/spool/torque/spool/' (4862189blocks available)
syslog enabled
Node Health Check Script: /opt/sdsc/sbin/health-check.sh (600 second update interval)
MOM active:
4825333 seconds
Check Poll Time:
45 seconds
Server Update Interval: 120 seconds
0 (use SIGUSR1/SIGUSR2 to adjust)
Communication Model:
MemLocked:
TCP Timeout:
120 seconds
/var/spool/torque/mom_priv/prologue (enabled)
Parallel Prolog:
/var/spool/torque/mom_priv/prologue.parallel (enabled)
Prolog Alarm Time:
300 seconds
Alarm Time:
0 of 10 seconds
Trusted Client List:
10.5.1.2:0,10.5.100.10:.100.11:.100.12:15003,10......
10.5.103.65 is not in the hierarchy file!!!
At pbs_server restart pbs_server sends out new file
# service pbs_server restart
qsub with catalina jobfilter
qsub -I -l nodes=gcn-17-36:ppn=16:native+gcn-14-71:ppn=16:native -q normal -l walltime=1:00:00
nfs client out of sync
% /cvmfs/oasis.opensciencegrid.org
% /cvmfs/cms.cern.ch
/cvmfs/cms.cern.ch \
10.7.0.0/16(ro,async,fsid=57d10b10-1780-40ef-ac73-bd)
10.5.104.118(ro,async,fsid=57d10b10-1780-40ef-ac73-bd) \
10.5.104.119(ro,async,fsid=57d10b10-1780-40ef-ac73-bd) \
10.5.104.120(ro,async,fsid=57d10b10-1780-40ef-ac73-bd) \
10.5.104.121(ro,async,fsid=57d10b10-1780-40ef-ac73-bd)
/cvmfs/oasis.opensciencegrid.org \
10.7.0.0/16(ro,async,fsid=9d-458f-bd99-95cfb9d54c0a)
10.5.104.118(ro,async,fsid=9d-458f-bd99-95cfb9d54c0a) \
10.5.104.119(ro,async,fsid=9d-458f-bd99-95cfb9d54c0a) \
10.5.104.120(ro,async,fsid=9d-458f-bd99-95cfb9d54c0a) \
10.5.104.121(ro,async,fsid=9d-458f-bd99-95cfb9d54c0a)
mounted ro over ib1 with Lustre
restart nfsd on ion-21-6
local disk full
ERROR local disk full
remove rocks install images: /insatll/rocks-dist/
cd /install
/bin/rm rocks-dist/
clean up per rack
for i in $(rocks run host rack4 command="df /" collate=true | grep "96%"|awk -F ":" '{ print $1 }'); do echo $ ssh $i "/bin/rm -r /install/rocks-dist/ ";
scratch not mounted
1 11:48:15 gcn-2-66 iscsid: conn 0 login rejected: initiator error - target not found (02/03)
1 11:48:15 gcn-2-66 iscsid: Connection1:0 to [target: sdp, portal: 10.7.104.56,3260] through [iface: default] is shutdown.
1 11:48:33 gcn-2-66 kernel: scsi8 : iSCSI Initiator over iSER
1 11:48:33 gcn-2-66 iscsid: Could not set session2 priority. READ/WRITE throughout and latency could be affected
1 11:48:33 gcn-2-66 kernel: scsi 8:0:0:0: RAID
Controller
0001 PQ: 0 ANSI: 5
1 11:48:33 gcn-2-66 kernel: scsi 8:0:0:0: Attached scsi generic sg1 type 12Sep
1 11:48:33 gcn-2-66 kernel: scsi
8:0:0:1: Direct-Access
VIRTUAL-DISK
0001 PQ: 0 ANSI: 5
1 11:48:33 gcn-2-66 kernel: sd 8:0:0:1: Attached scsi generic sg2 type 0
1 11:48:33 gcn-2-66 kernel: sd 8:0:0:1: [sdb] 2-byte logical blocks: (300 GB/279 GiB)
1 11:48:33 gcn-2-66 kernel: sd 8:0:0:1: [sdb] Write Protect is off
1 11:48:33 gcn-2-66 kernel: sd 8:0:0:1: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
1 11:48:33 gcn-2-66 kernel: sdb:
1 11:48:33 gcn-2-66 kernel: sd 8:0:0:1: [sdb] Unhandled sense code
1 11:48:33 gcn-2-66 kernel: sd 8:0:0:1: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
1 11:48:33 gcn-2-66 kernel: sd 8:0:0:1: [sdb] Sense Key : Medium Error [current]
1 11:48:33 gcn-2-66 kernel: sd 8:0:0:1: [sdb] Add. Sense: Unrecovered read error
1 11:48:33 gcn-2-66 kernel: sd 8:0:0:1: [sdb] CDB: Read(10): 28 00 00 00 00 00 00 00 08 00
1 11:48:33 gcn-2-66 kernel: Buffer I/O error on device sdb, logical block 0Sep
1 11:48:33 gcn-2-66 kernel: sd 8:0:0:1: [sdb] Unhandled sense code
1 11:48:33 gcn-2-66 kernel: sd 8:0:0:1: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
1 11:48:33 gcn-2-66 kernel: sd 8:0:0:1: [sdb] Sense Key : Medium Error [current]
1 11:48:33 gcn-2-66 kernel: sd 8:0:0:1: [sdb] Add. Sense: Unrecovered read error
1 11:48:33 gcn-2-66 kernel: sd 8:0:0:1: [sdb] CDB: Read(10): 28 00 00 00 00 00 00 00 08 00
1 11:48:33 gcn-2-66 kernel: Buffer I/O error on device sdb, logical block 0Sep
1 11:48:33 gcn-2-66 kernel: sd 8:0:0:1: [sdb] Unhandled sense code
tgt configuration change!
[root@ion-21-8 ~]# lsscsi
INTEL SSDSA2BZ30 0362
INTEL SSDSA2BZ30 0362
INTEL SSDSA2BZ30 0362
INTEL SSDSA2BZ30 0362
INTEL SSDSA2BZ30 0362
INTEL SSDSA2BZ30 0362
INTEL SSDSA2BZ30 0362
INTEL SSDSA2BZ30 0362
INTEL SSDSA2BZ30 0362
INTEL SSDSA2BZ30 0362
INTEL SSDSA2BZ30 0362
INTEL SSDSA2BZ30 0362
INTEL SSDSA2BZ30 0362
INTEL SSDSA2BZ30 0362
INTEL SSDSA2BZ30 0362
INTEL SSDSA2BZ30 0362
system disks:
INTEL SSDSA2CW08 4PC1
INTEL SSDSA2CW08 4PC1
switch off turbo
[root@gcn-2-71 ~]# ~hocks/VSMP/BIOS/syscfg /d biossettings "Intel(R) Turbo Boost Technology"
Intel(R) Turbo Boost Technology
===============================
Current Value : Enabled
-----------------------
Possible Values
---------------
0 : Disabled
1 : Enabled
[root@gcn-2-71 configs]# ~hocks/VSMP/BIOS/syscfg /d biossettings "Enhanced Intel SpeedStep(R) Tech"
Enhanced Intel SpeedStep(R) Tech
================================
Current Value : Enabled
-----------------------
Possible Values
---------------
0 : Disabled
1 : Enabled
~hocks/VSMP/BIOS/syscfg /bcs "Enhanced Intel SpeedStep(R) Tech" 0
To switch off turbo:
[root@gcn-2-83 ~]# ~hocks/VSMP/BIOS/syscfg /bcs "" "Enhanced Intel SpeedStep(R) Tech" 0
Successfully Completed
[root@gcn-2-83 ~]# ~hocks/VSMP/BIOS/syscfg /d biossettings "Intel(R) Turbo Boost Technology"
BIOS Variable 'Intel(R) Turbo Boost Technology' not found...
[root@gcn-2-83 ~]# ~hocks/VSMP/BIOS/syscfg /d biossettings "Enhanced Intel SpeedStep(R) Tech"
Enhanced Intel SpeedStep(R) Tech
================================
Current Value : Disabled
------------------------
Possible Values
---------------
0 : Disabled
1 : Enabled
Restore all of BIOS file:
[root@gcn-2-71 configs]# /share/apps/syscfg /b /r bios_conf2.INI
Restoring file bios_conf2.INI in progress...
Successfully Completed
[root@gcn-2-71 configs]# ~hocks/VSMP/BIOS/syscfg /d biossettings "Intel(R) Turbo Boost Technology"
BIOS Variable 'Intel(R) Turbo Boost Technology' not found...
[root@gcn-2-71 configs]# ~hocks/VSMP/BIOS/syscfg /d biossettings "Enhanced Intel SpeedStep(R) Tech"
Enhanced Intel SpeedStep(R) Tech
================================
Current Value : Disabled
------------------------
Possible Values
---------------
0 : Disabled
1 : Enabled
create BIOS file:
/share/apps/syscfg /b /s BIOS.76.ini
catalina reservatioon for 16 specific nodes
/state/partition1/catalina/bin/set_res --start=12:00_08/06/2015 --end=23:59_09/30/2015 \
--resource_amount=16 --user_list=boj --node_restriction='if input_tuple[0]["Machine"] in ["gcn-17-31", "gcn-17-32",\
"gcn-17-33", "gcn-17-34", "gcn-17-35", "gcn-17-36", "gcn-17-37", "gcn-17-38", "gcn-17-41", "gcn-17-42", \
"gcn-17-43", "gcn-17-44", "gcn-17-45", "gcn-17-46", "gcn-17-47", "gcn-17-48", ]: result = 0'
--mode=real
/var 100% full
huge authpriv file:
# rm /var/log/authpriv
# touch /var/log/authpriv
# service rsyslogd restart
add authpriv to logrotate:
/etc/logrotate.d/syslog
/var/log/cron
/var/log/maillog
/var/log/messages
/var/log/secure
/var/log/spooler
/var/log/authpriv
sharedscripts
postrotate
/bin/kill -HUP `cat /var/run/syslogd.pid 2> /dev/null` 2> /dev/null || true
list missing memory DIMM
list missing slot with dmidecode.
# dmidecode -t memory
Memory Device
Array Handle: 0x003E
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: No Module Installed
Form Factor: DIMM
Locator: DIMM_A1
Bank Locator: NODE 0 CHANNEL 0 DIMM 0
Type: Unknown
Type Detail: Synchronous
Speed: Unknown
Manufacturer: NO DIMM
Serial Number: NO DIMM
Asset Tag: Unknown
Part Number: NO DIMM
Rank: Unknown
Handle 0x0046, DMI type 17, 28 bytes
Memory Device
Array Handle: 0x003E
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 8192 MB
Form Factor: DIMM
Locator: DIMM_B1
Bank Locator: NODE 0 CHANNEL 1 DIMM 0
Type: DDR3
Type Detail: Synchronous
Speed: 1333 MHz
Manufacturer: Samsung
Serial Number: 879B07F2
Asset Tag: Unknown
Part Number: M393B1K70CH0-CH9
ion-21-14 phony home mounts
mount/df shows 4250 home mounts
# service autofs stop
# cat /proc/mounts
# edit /etc/mtab and remove home mounts
# service autofs start
lustre quota /oasis/projects/nsf
The default allocation is 500 GB of Project Storage to be shared among all users of a project.
Projects that require more request additional space via the XSEDE POPS system in the form of a storage allocation
request (for new allocations)
or supplement (for existing allocations)
# lfs quota -g csd181 /oasis/projects/nsf/
Disk quotas for group csd181 (gid 6912):
Filesystem
/oasis/projects/nsf/
partition error, not enough free space
ssh -p2200 gcn-13-14
-bash-4.1# lsscsi
INTEL SSDSA2CW08 4PC1
-bash-4.1# smartctl -i /dev/sda
smartctl 5.42
r3458 [x86_64-linux-2.6.32-279.14.1.el6.x86_64] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Model Family:
Intel 320 Series SSDs
Device Model:
INTEL SSDSA2CW080G3
Serial Number:
LU WWN Device Id: 5 65464d
Firmware Version: 4PC10302
User Capacity:
8,388,608 bytes [8.38 MB]
Sector Size:
512 bytes logical/physical
Device is:
In smartctl database [for details use: -P show]
ATA Version is:
ATA Standard is:
ATA-8-ACS revision 4
Local Time is:
Thu Jun 18 12:04:09 2015 PDT
SMART support is: Unavailable - device lacks SMART capability.
# parted -l
Model: ATA INTEL SSDSA2CW08 (scsi)
Disk /dev/sda: 80.0GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos
File system
linux-swap(v1)
home filesystem
home server May 2015:
/export/nfs-32-4/home/hocks
10.5.10.21
comet-nfs-32-1.local
comet-nfs-32-1
10.5.10.22
comet-nfs-32-2.local
comet-nfs-32-2
10.5.10.23
comet-nfs-32-3.local
comet-nfs-32-3
10.5.10.24
comet-nfs-32-4.local
comet-nfs-32-4
10.22.10.11
comet-nfs-32-1.local
comet-nfs-32-1
10.22.10.12
comet-nfs-32-2.local
comet-nfs-32-2
10.22.10.13
comet-nfs-32-3.local
comet-nfs-32-3
10.22.10.14
comet-nfs-32-4.local
comet-nfs-32-4
autofs server IP change
after chagnes to the NFS server ip in any auto.* table restart with
service autofs force-reload
ipmitool: Invalid user name
# ipmitool user set name 6 gordon
# ipmitool user set password 6
gordonadmin
# ipmitool channel setaccess 1 6 callin=on ipmi=on link=on privilege=4
# ipmitool sol payload enable 1 6
# ipmitool lan print 1
Set Session Privilege Level to ADMINISTRATOR failed
# ipmitool user enable 6
# ipmitool user list 1
Trestles IPMI
on rzri6e:
# ssh -Y -C -L 5988:localhost:5988 trestles-fe2
on trestles-fe2 # vncserver :88 -localhost
on rzri6e:
# vncviewer localhost:88
on trestles-fe2 # firefox
and go to node ipmi : trestles-2-30-ipmi.ipmi
click "Remote Control" in top bar
to allow secure access:
on trestles-fe2: /usr/java/latest/bin/ControlPanel &
and set security to Medium
rocks run roll fails
# rocks run roll slurm | sh
Traceback (most recent call last):
File "/opt/rocks/bin/rocks", line 260, in
command.runWrapper(name, args[i:])
sh: line 1: XML: command not found
xml.sax._exceptions.SAXParseException: :98:24: duplicate attribute
creating the host kickstart shows:
# rocks list host profile hpcdev-005
XML parse error in file ./nodes/slurm-server.xml on line 3
xml.sax._exceptions.SAXParseException: :98:24: duplicate attribute
tarting iscsi: iscsiadm: Could not login to [iface: default, target: sde, portal: 10.7.104.115,3260].
iscsiadm: initiator reported error (19 - encountered non-retryable iSCSI login failure)
iscsiadm: Could not log into all portals
/sbin/iscsiadm -m discovery -t sendtargets -p 10.7.104.115 -v iser
/sbin/iscsiadm -m node -T sdd -p 10.7.104.115 --op update -n node.transport_name -v iser
/sbin/iscsiadm -m node --login
mount /scratch
kipmi0 100% CPU
echo 100 >
/sys/module/ipmi_si/parameters/kipmid_max_busy_us
limits kipmi0 to
# rpcinfo -p
portmapper has been replace by rpcbind
rpcinfo: can't contact portmapper: RPC: Remote system error - No such file or directory
[root@gcn-19-41 init.d]# service rpcbind start
Starting rpcbind:
[root@gcn-19-41 init.d]# rpcinfo -p
program vers proto
portmapper
portmapper
portmapper
portmapper
portmapper
portmapper
ibnet0 no tcp ping
PING ion-21-12.ibnet0 (10.6.104.113) 56(84) bytes of data.
From ion-21-10.ibnet0 (10.6.104.111) icmp_seq=2 Destination Host Unreachable
CA 'mlx4_0'
Physical state: LinkUp
CA 'mlx4_1'
Physical state: LinkUp
--> restart subnet manger on gordon-fe3
# /etc/init.d/opensm restart
# /etc/init.d/opensm status
opensm (pid 97445) is running...
/usr/sbin/opensm-ibnet0 --daemon -F /etc/opensm/opensmd-ibnet0.conf -f /var/log/opensm-ibnet0.log
ssh_exchange_identification: Connection closed by remote host
sshd - kerberos access blocked after 10 connections
sshd socket in Close_Wait
[hocks@ion-21-13 ~]$ qsub -I -l nodes=gcn-20-22
1 99 09:18 ?
07:39:44 sshd: root [pam]
1 99 09:18 ?
07:39:26 sshd: root [pam]
1 99 09:19 ?
07:38:48 sshd: root [pam]
1 99 09:19 ?
07:38:28 sshd: root [pam]
1 99 09:19 ?
07:38:09 sshd: root [pam]
1 99 09:19 ?
07:38:09 sshd: root [pam]
1 99 09:19 ?
07:37:56 sshd: root [pam]
1 99 09:20 ?
07:37:12 sshd: root [pam]
1 99 09:21 ?
07:36:25 sshd: root [pam]
1 99 09:28 ?
07:29:41 sshd: root [pam]
[root@gcn-20-22 ~]# killall -9 sshd
[root@gcn-20-22 ~]# service sshd start
After daemon start:
[hocks@gcn-20-12 ~]$ ps -ef |grep ssh
00:00:00 /usr/sbin/sshd
0 10:42 pts/0
00:00:00 grep ssh
[hocks@gcn-20-12 ~]$
qsub: job 2240.ion-21-13.local completed
job exit status 265 handled
11: Invalid memory reference
(segmentation fault)
pbs_mom for this job gets killed!
export DAEMON_COREFILE_LIMIT=-1
Starting TORQUE Mom:
[root@gcn-20-12 ~]# Connection to gcn-20-12 closed by remote host.
Connection to gcn-20-12 closed.
[hocks@ion-21-13 server_logs]$ ssh gcn-20-12
ssh: connect to host gcn-20-12 port 22: Connection refused
--- reboot
lustre LBUG 4403
LBUG as this one. Affected version 2.4.1, fixed in 2.5.2 and 2.6.0
https://jira./browse/LU-4403
Sep 16 19:18:38 meerkat-mds-10-1 kernel: LustreError:30176:0:(ldlm_lock.c:851:ldlm_lock_decref_internal_nolock()) ASSERTION( lock->l_readers > 0 ) failed:
Sep 16 19:18:38 meerkat-mds-10-1 kernel: LustreError: 30176:0:(ldlm_lock.c:851:ldlm_lock_decref_internal_nolock()) LBUG
lustre bug LU-5314
MDS hit a Lustre Bug, It looks like the crash was very sudden
fixed in 2.4.4
mdadm --create /dev/md0 --force --level 5 --size=1024 --raid-devices 16 $SSD
#mkfs.xfs -f /dev/md0
agsize (3840 blocks) too small, need at least 4096 blocks
---: remove --size from mdadm command for raid5
kernel bug CVE-
Unprivileged user becomes the kernel.
2.6.32 and above are
/errata/RHSA-.html
fixed in kernel-2.6.32-431.20.3.el6
catalina reservation, exclude offline nodes
set_res --start=10:00_06/21/2014 --end=10:00_07/21/2014 --user_list=mahidhar,tcooper \
--mode=real --resource_amount=4 --feature_list=flash \
--node_restriction="if 'flash' in input_tuple[0]['properties_list'] and input_tuple[0]['State'] in ['Idle','Running'] : result = 0"
cannot run local prolog '/var/spool/torque/mom_priv/prologue.parallel'
Lustre client bug, LU-4308
trestles-10-21.sdsc.edu: kernel: Lustre: 17211:0:(vvp_io.c:699:vvp_io_fault_start())
binary[0xx] changed while waiting for the page fault lock
https://jira./browse/LU-4308
Trestles bridge
bridging IB to eth for oasis1 router
inactive Bridge: trestles-br1
active Bridge: trestles-br2
Jun 18 18:41:20 trestles-br2 bxmd[3119]: WARN A3 : Received duplicate login from vNIC 0x090:0c:c9:10:17:a3, dropping.
restart network on nodes with bond down
ifcfg-bond0
DEVICE=bond0
IPADDR=198.202.118.220
NETMASK=255.255.254.0
BOOTPROTO=none
ONBOOT=yes
BONDING_OPTS="mode=1 primary=eth2 miimon=100"
cat ifcfg-eth2
DEVICE=eth2
HWADDR=00:0C:C9:08:03:A2
BOOTPROTO=none
ONBOOT=yes
MASTER=bond0
ifcfg-eth3
DEVICE=eth3
HWADDR=90:0C:C9:08:03:A2
BOOTPROTO=none
ONBOOT=yes
MASTER=bond0
any host 192.168.110.76 gw 10.1.1.1
any host trestles-fe1 gw 0.0.0.0
any net 224.0.0.0 netmask 255.255.255.0 dev eth0
any host 255.255.255.255 dev eth0
Lustre: options lnet networks=tcp(eth2)
172.25.33.252@tcp:172.25.33.124@tcp:/puma
83% /oasis/scratch/trestles
172.25.33.53@tcp:172.25.33.25@tcp:/meerkat
41% /oasis/projects/nsf
iptables blacklist
/opt/sdscsec/sbin/iptables-blacklist.sh
blacklist:
/etc/sdsc/ip-blacklist
220.177.198.87
91.209.108.28
116.10.191.228
61.174.51.217
60.211.213.66
117.79.91.234
198.74.103.2
health check
health check script output to stdout, only ERROR message
If any command output to sdtout:
07/21/:56;0002;
pbs_mom.4220;check_Node health script ran and says the node is healthy
# pbsnodes gcn-18-57
state = down,job-exclusive
properties = native,noflash
ntype = cluster
status = rectime=,varattr=,jobs=136493...
...,message=ERROR oasis scratch problems,.....
[root@gcn-18-57 ~]# momctl -q clearmsg
localhost:
clearmsg = 'messages cleared'
wait for the next server update!!!!
06/04/:10;0002;
pbs_mom.16900;check_Setting node to down. The node health script output the
06/04/:07;0002;
pbs_mom.16900;check_Node health script ran and says the node is healthy with
06/04/:25;0002;
pbs_mom.16900;Spbs_Torque Mom Version = 4.2.6.1, loglevel = 0
06/04/:07;0002;
pbs_mom.16900;check_Node health script ran and says the node is healthy with
06/04/:24;0002;
pbs_mom.16900;n/a;clear_rm_messages cleared
06/04/:07;0002;
pbs_mom.16900;check_Node health script ran and says the node is healthy with
torque 4.2.6 script time does not mark down node
message=ERROR: prolog/epilog failed,file: /var/spool/torque/mom_priv/prologue.parallel,exit: 255,nonzero p/e exit status
(fixed in torque 4.2.5??)
/torque/Content/topics/12-appendices/prologueAndEpilogueScriptsTimeOut.htm
TORQUE takes preventative measures against prologue and epilogue scripts
by placing an alarm around the scripts execution. By default, TORQUE
sets the alarm to go off after 5 minutes of execution. If the script
exceeds this time, it will be terminated and the node will be marked
down. ....
ost inactive
# lfs df -h | grep inac
: inactive device
# lctl dl -t
| grep OST002c
47 UP osc monkey-OST002c-osc-ffff0 98e1d6e3-d382-66d1-f517-49aac 172.25.33.240@tcp
# echo 1 > /proc/fs/lustre/osc/monkey-OST002c-osc-ffff0/active
fsck disk on OST002c
remount lustre
dmesg time stamp
set time stamp in CentOS: echo Y > /sys/module/printk/parameters/time
Display time stamp real time:
#cat dmesg_realtime.sh
#!/bin/bash
ut=`cut -d' ' -f1 < /proc/uptime`
ts=`date +%s`
realtime_date=`date -d"70-1-1 + $ts sec - $ut sec + $1 sec" +"%F %T"`
echo $realtime_date
dmesg_realtime.sh 461
pbs_server crash
Apr 23 21:29:39 gordon-fe2 kernel: pbs_server[10961]: segfault at 3bdf4ea510 ip 76205 sp
2264d0 error 4 in libc-2.12.so[3bda000]
lustre kernel panic bug LU-4558
LU-4558 clio: Solve a race in cl_lock_put
lustre/obdclass/cl_lock.c
It's not atomic to check the last reference and state of cl_lock
in cl_lock_put(). This can cause a problem that an using lock is
freed, if the process is preempted between atomic_dec_and_test()
and (lock->cll_state == CLS_FREEING).
This problem can be solved by holding a refcount by coh_locks. In
this case, it can be sure that if the lock refcount reaches zero,
nobody else can have any chance to use it again.
steps to reproduce the lustre client build
autofs gordon-fe2
set up logging:
/etc/sysconfig/autofs: LOGGING="debug"
# pidstat -p 17628
Linux 2.6.32-358.6.2.el6.x86_64 (gordon-fe2.sdsc.edu)
04/04/2014
01:19:59 PM
%usr %system
01:19:59 PM
pbs_server
# ps -eLf|grep
18 Mar24 ?
00:03:52 /opt/torque/sbin/pbs_server -H gordon-fe2.local -d /var/spool/torque
4 13:20:38 gordon-fe2 automount[745]: handle_packet_missing_indirect: token 49440, name andreim, request pid 17738
04/04/:38;0008;PBS_Server.17738;Jreq_job_id: 1260949.gordon-fe2.local
Output_Path = gordon-ln1.sdsc.edu:/home/andreim/
4 13:21:23 gordon-fe2 automount[745]: handle_packet_missing_indirect: token 49442, name ps-ngbt, request pid 17659
04/04/:23;0008;PBS_Server.17659;Jreq_job_id: 1260950.gordon-fe2.local
Output_Path = gordon-ln2.sdsc.edu:/projects/ps-ngbt
login/submit nodes do not have a mom_priv/config file !
Torque Data Management:
enable MOM to copy job files local
$usecp *:/home /home
SERVERMODE (NORMAL)
forcing TZ to (PST8PDT)
TZ (PST8PDT)
Error connecting
$ show_res
SERVERMODE (NORMAL)
forcing TZ to (PST8PDT)
TZ (PST8PDT)
Traceback (most recent call last):
File "/state/partition1/catalina/bin/show_res", line 223, in
reservations_list.sort(by_endpriority)
File "/state/partition1/catalina/bin/show_res", line 47, in by_endpriority
if first['jobs_db_handle'][0][first['job_runID']]['priority'] >
second['jobs_db_handle'][0][second['job_runID']]['pr
iority'] :
KeyError: 'priority'
DB corrupted, Kenneth removed and recreated
increase nfsd NFSv4 server on ubuntu
Edit /etc/default/nfs-kernel-server and adjust
RPCNFSDCOUNT
Afterwards:
$ /etc/init.d/nfs-kernel-server restart
reuse ssh connection ports
~/.ssh/config
controlmaster auto
controlpath /tmp/ssh-%r@%h:%p
X11 forwarding
on rzri3e:
xauth list
on gorond-ln2: xauth
xauth> add rzri6e:0 MIT-MAGIC-COOKIE-1
xauth> exit
export DISPLAY=rzri6e:0
$ qsub -I -X -q normal -l nodes=1:ppn=1:native,walltime=10:00
qsub -I fails
pbs_mom: LOG_ERROR::Operation now in progress (115) in
start_interactive_session, cannot open interactive qsub socket to host
trestles-login1.sdsc.edu:44641 - 'cannot connect to port 777 in client_to_svr -
errno:115 Operation now in progress' - check routing tables/multi-homed host issues
login nodes with multiple entries for ...sdsc.edu
cleaned up:
198.202.118.30
trestles-login1.sdsc.edu
198.202.118.31
trestles-login2.sdsc.edu
torque 2.4.6 no mail
set server mail_domain = never
gmond buffer overflow
it's a bug in the 3.2 version of ganglia:
/cgi-bin/bugzilla/show_bug.cgi?id=298
torque 4.2.6
defect: Undefined
(15002) in send_job_work,
fixed in 4.2.6.h1, 4.2.7
Nov 12 11:39:03 ion-21-14 /usr/sbin/gmetad[18907]: RRD_update
/var/lib/ganglia/rrds/Gordon Hadoop Cluster/gcn-20-36.local/mem_cached.rrd: illegal attempt to update
using time
when last update time is
(minimum one second step)
On my side, the problem is fixed.
It was a side effect of the usage of the "spoof" option, when "spoof" option
was used with couple IP/hostname not attached to a gmond, the interaction
port of gmetad reported twice the summary metrics of the cluster.
PBS_SLOG_ERROR::get_node_from_str, Node ion-
21-1 is reporting on node ion-21-1.sdsc.edu, which pbs_server doesn't know about
/etc/sysconfig/pbs_mom
SBIN_PATH=/opt/torque/sbin
# NOTE: hostname flag must match what is listed in nodelist
PBS_DAEMON="$SBIN_PATH/pbs_mom -H ion-21-1"
PBS_HOME=/var/spool/torque
mpi job not running under tomcat account
The problem is that when the job requesting more than 16 processors, that
is, more than two nodes submitted by tomcat account, while executing, mpi
errors are occurred.
The following works:
export MV2_USE_SHARED_MEM=0
mpirun_rsh -export -np $NUM_OF_PROCESSORS -hostfile $PBS_NODEFILE $taudem_bin/osu_bcast
root@local mail
-r root@gordon-fe3.sdsc.edu .....
root aliases mail
/etc/aliases
(expanded from ): host
postal.sdsc.edu[132.249.20.114] said: 553 5.1.8 Sender domain local does
not exist (in reply to end of DATA command)
Configure for .local masquerade:
myorigin = $myhostname
mydomain = local
mydestination = $mydomain, $myhostname, localhost.$mydomain, gordon-fe3.$mydomain
append_dot_mydomain = no
inet_interfaces = 127.0.0.1, 10.5.1.1
mynetworks = 10.5.0.0/16, 127.0.0.0/8
masquerade_classes = envelope_sender, envelope_recipient, header_sender, header_recipient
masquerade_domains = !gordon-fe3.sdsc.edu sdsc.edu
masquerade_exceptions = root
local_header_rewrite_clients = permit_mynetworks
smtp_generic_maps = hash:/etc/postfix/generic
/etc/postfix/generic:
sender-canonical
@local @gordon-fe3.sdsc.edu
recipient-canonical
root@sdsc.edu root
Create DB:
# postmap /etc/postfix/generic
# service postfix restart
check postfix configuration:
postconf -d
List job exec host
for i in `cat /tmp/jobs.to.check | awk -F"." '{ print $1 }'`; do grep $i 201309* | grep exec_host | awk '{ print $1 " " $12 }' | awk -F"/" '{ print $1 $3 }' ; done
X forwarding
# ssh -Y gordon-ln2
[hocks@gorson-ln2 ~]$ env|grep DISPLAY
DISPLAY=localhost:13.0
[hocks@gorson-ln2 ~] qsub -X -I -q normal -l nodes=1:native,walltime=2:00:00
qsub: waiting for job 1806045.gordon-fe2.local to start
Jan 14 11:10:46 gcn-5-51 pbs_mom: LOG_ERROR::Operation now in progress (115) in start_interactive_session,
cannot open interactive qsub socket to host gordon-ln2.sdsc.edu:50654 - 'cannot connect to port 777 in client_to_svr -
errno:115 Operation now in progress' - check routing tables/multi-homed host issues
flash drives missing
# lsmod|grep scsi
scsi_transport_iscsi
2 ib_iser,libiscsi
modprobe mpt2sas
# lsmod|grep scsi
scsi_transport_sas
scsi_transport_iscsi
2 ib_iser,libiscsi
INTEL SSDSA2CW08 4PC1
INTEL SSDSA2CW08 4PC1
INTEL SSDSA2BZ30 0362
INTEL SSDSA2BZ30 0362
INTEL SSDSA2BZ30 0362
INTEL SSDSA2BZ30 0362
INTEL SSDSA2BZ30 0362
INTEL SSDSA2BZ30 0362
INTEL SSDSA2BZ30 0362
INTEL SSDSA2BZ30 0362
INTEL SSDSA2BZ30 0362
INTEL SSDSA2BZ30 0362
INTEL SSDSA2BZ30 0362
INTEL SSDSA2BZ30 0362
INTEL SSDSA2BZ30 0362
INTEL SSDSA2BZ30 0362
INTEL SSDSA2BZ30 0362
INTEL SSDSA2BZ30 0362
set up /etc/modprobe.d/mpt2sas.conf
alias scsi_hostadapter mpt2sas
java with user memory limits
$ java -Xmx1G -Xmn1G -Xms1G -version
Error occurred during initialization of VM
Too small initial heap for new size specified
[mahidhar@gordon-ln1 ~]$ java -Xmx256m -version
java version "1.7.0_13"
Java(TM) SE Runtime Environment (build 1.7.0_13-b20)
Java HotSpot(TM) 64-Bit Server VM (build 23.7-b01, mixed mode)
(512m does not :))
iser drives not mounting
ion-21-14 reinstalled, tgt running but
# tgtadm --mode target --op show
LUN information:
---- No devices!
look for previously created raid on flash drives (does NOT get remove by installation!)
# mdadm --detail /dev/md127
/dev/md127:
Version : 1.2
Creation Time : Tue Aug 27 13:41:47 2013
Raid Level : raid0
Array Size :
(4471.38 GiB 4801.10 GB)
Raid Devices : 16
Total Devices : 16
Persistence : Superblock is persistent
RaidDevice State
active sync
active sync
active sync
active sync
active sync
active sync
active sync
active sync
active sync
active sync
active sync
active sync
active sync
active sync
active sync
active sync
stop raid:
mdadm --stop
/dev/md127
remove superblock per drive:
mdadm --misc --zero-superblock /dev/sdc .....
for i in $(lsscsi | grep SSDSA2BZ30 | awk '{ print $7 }'); do echo $i; mdadm --zero-superblock $ done
tgtd: tgt_mgmt(395) driver iser is in state: error
reboot system
kernel: rpcbind[3140] general protection ip:7f sp:7fff3b785040 error:0 in libc-2.12.so (deleted)[7f+18a000]
related to channeld (411) make or 411get
Regarding the other problem of the channeld crash, my guess is that
since the RPC libraries have been updated (you are running 6.4 and not
6.3, the version we used to compile channeld), you might want to try
to recompile the channeld rpm on 6.4, so the C stubs will be generated
with the correct version of rpcgen.
To factor out the rpcgen issue you should also try to recompile the
package sec-channel, it also uses rpcgen and it generates two
packages: rocks-sec-channel-client-xxxx.rpm and
rocks-sec-channel-server-xxxx.rpm.
postgres does not start
runuser: /dev/null: Permission denied
/etc/passwd:
postgresql:x::Database Admin,SDSC:/:/dev/null
/etc/init.d/postgresql-9.0
if [ -x /sbin/runuser ]
SU="runuser -s /bin/bash"
SU="su -s /bin/bash"
For restoring the iSER drives without a reboot
ssh $node 'umount / s sleep 2;
ls -l /dev/ mount /dev/sdb /scratch';
CentOS 6 kernel pxe boot problem for localboot
reference:
http://www.syslinux.org/wiki/index.php/Hardware_Compatibility#LOCALBOOT_on_I
There is a problem with compute nodes booting from their local drive after
the original PXE install of the Rocks compute node image.
'LOCALBOOT 0' will hang
use chain.c32 module and following syntax to boot up the first hard drive:
LABEL localboot
MENU LABEL Boot from first hard drive
COM32 chain.c32
APPEND hd0 0
# cp /usr/share/syslinux/chain.c32 /tftpboot/pxelinux/
# rocks add bootaction action=hpos args="hd0" kernel="com32 chain.c32" .
bootaction:
com32 chain.c32
--------------------- hd0 0
# rocks set host runaction
action=hpos
xterm -e ssh -p2200 gcn-14-12
ssh unsuccessful
ssh baitaluk@ion-21-12.sdsc.edu
/var 100% full
TypeError: 'NoneType' object is not iterable
# rocks list host
TypeError: 'NoneType' object is not iterable
# rocks list roll
error - unknown roll name "%"
make sure /var is not 100% either.
The daemon is start by:
/etc/init.d/foundation-mysql
service foundation-mysql start
If the daemon is up you probably have messed up the authentication.
You should be able to connect to the DB from the frontend with:
/opt/rocks/bin/mysql --defaults-extra-file=/root/.f --user=root
Rocks yum install
# yum update mvapich2_intel_ib
Setting up Update Process
Resolving Dependencies
--> Running transaction check
---> Package mvapich2_intel_ib.x86_64 0:1.9a2-0 will be updated
---> Package mvapich2_intel_ib.x86_64 0:1.9a2-1 will be an update
--> Finished Dependency Resolution
Dependencies Resolved
apbs_intel_openmpi_ib-1.3-4.x86_64 has missing requires of
libmkl_sequential.so()(64bit)
add to version.mk
RPM.EXTRAS = "AutoReq: no"
rebuilt the roll and distro.
fash drive xfs_log_force: error 5
ion-21-12 kernel: Filesystem "md0": xfs_log_force: error 5 returned.
Error 5 is EIO
xfs_log_force basically passed the error through
from the underlying device.
Most likely it is an issue with the
underlying storage device.
end_request: I/O error, dev sdn, sector
sd 8:0:3:0: [sdn] Synchronizing SCSI cache
sd 8:0:3:0: [sdn] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Rocks default route
rocks list host route gcn-3-71
gcn-3-71: 0.0.0.0
198.202.100.1 H
gcn-3-71: 172.25.32.0
255.255.254.0
192.168.230.1 G
gcn-3-71: 192.168.110.100 255.255.255.255 10.5.1.1
gcn-3-71: 224.0.0.0
255.255.255.0
gcn-3-71: 255.255.255.255 255.255.255.255 eth0
# rocks remove host route gcn-3-71 0.0.0.0
# rocks add host route gcn-3-71 0.0.0.0 198.202.104.1 netmask=0.0.0.0
gcn-3-71: 0.0.0.0
198.202.104.1 H
gcn-3-71: 172.25.32.0
255.255.254.0
192.168.230.1 G
gcn-3-71: 192.168.110.100 255.255.255.255 10.5.1.1
gcn-3-71: 224.0.0.0
255.255.255.0
gcn-3-71: 255.255.255.255 255.255.255.255 eth0
umount busy filesystem
force,lazy
umount -fl /filesystem
GSI-enabled OpenSSH
debug1: Unspecified GSS failure.
Minor code may provide more information
Unknown code krb5 195
Feb 15 10:30:24 gcn-17-11 sshd[131322]: error: USAGE-STATS: Error initializing (usage-stats.cilogon.org:4810) (VvMm)
Feb 15 10:30:24 gcn-17-11 sshd[131322]: error: Error initializing Globus Usage Metrics, but continuing ...
http://grid.ncsa.illinois.edu/ssh/privacy.html
DisableUsageStats option in sshd_config to "true"
"GLOBUS_USAGE_OPTOUT=1" in the environment of the sshd.
switch off 411 on Compute node
Trestles-login1 PGI flexlm
# ssh trestles-login1
# su - diag
# cd /home/diag/pgi/linux86-64/10.9/bin
# lmgrd -c
/home/diag/pgi/license.dat
-l /home/diag/pgi/flexlm.log
# lmutil lmstat
Flexible License Manager status on Thu 1/31/
License server status: 27000@trestles-login1.sdsc.edu
License file(s) on trestles-login1.sdsc.edu:
/home/diag/pgi/license.dat:
trestles-login1.sdsc.edu: license server UP (MASTER) v11.7
Vendor daemon status (on trestles-login1.sdsc.edu):
pgroupd: UP v11.7
0 10:40 pts/11
00:00:00 lmgrd -c /home/diag/pgi/license.dat -l /home/diag/pgi/flexlm.log
00:00:00 pgroupd -T trestles-login1.sdsc.edu 11.7 3 -c /home/diag/pgi/license.dat
--lmgrd_start 510aba90
mpi end points open
mpirun job won't start due to left over end points
mpirun left over on shared nodes
https://www.osc.edu/~djohnson/mpiexec/
Mpiexec is a replacement program for the script mpirun. It is used to initialize a parallel job from within a PBS
batch or interactive environment. Mpiexec uses the task manager library of PBS to spawn copies of the executable on
the nodes in a PBS allocation.
mount to NFS server '10.6.104.116' failed: System Error: Connection refused
ip tables Scott:
rocks add firewall host=ion-21-15 network=ibnet0 protocol=udp service=2049 chain=INPUT action=ACCEPT rulename=A50-IBNFS-SERVER-nfsudp
rocks add firewall host=ion-21-15 network=ibnet0 protocol=tcp service=2049 chain=INPUT action=ACCEPT rulename=A50-IBNFS-SERVER-nfstcp
sudo rocks add firewall host=ion-21-15 network=ibnet0 protocol=tcp service=111 chain=INPUT action=ACCEPT rulename=A50-IBNFS-SERVER-portmaptcp
rocks add firewall host=ion-21-15 network=ibnet0 protocol=udp service=111 chain=INPUT action=ACCEPT rulename=A50-IBNFS-SERVER-portmapudp
rocks sync host firewall ion-21-15
/etc/hosts.allow on ion-21-15
portmap: 10.5.0.0/255.255.0.0, 10.6.104.0/255.255.255.0
mountd: 10.5.0.0/255.255.0.0, 10.6.104.0/255.255.255.0
statd: 10.5.0.0/255.255.0.0, 10.6.104.0/255.255.255.0
rquotad: 10.5.0.0/255.255.0.0, 10.6.104.0/255.255.255.0
lockd: 10.5.0.0/255.255.0.0, 10.6.104.0/255.255.255.0
channeld: 10.5.104.116, 10.5.1.1
flashing blue ID LED
on GB812X blue ID LED above USB is flashing
and cannot be switched off via ID button
Usually when a blue LED is blinking, the IPMI detects an activity in the motherboard and
creates a log on the "ipmi sel". You can check it by this command 'ipmitool sel elist'.
--: reseat blade
grub menu in SOL
/etc/grub.conf
##hiddenmenu
serial --unit=0 --speed=115200
terminal --timeout=30 serial console
version 0.97
(640K lower / 883712K upper memory)
+-------------------------------------------------------------------------+
| CentOS (2.6.32-220.23.1.el6.vSMP.4.x86_64)
| vSMP (2.6.32-220.7.1.el6.vSMP.4.x86_64)
| Rocks (2.6.34.7-1)
+-------------------------------------------------------------------------+
Use the ^ and v keys to select which entry is highlighted.
Press enter to boot the selected OS, 'e' to edit the
commands before booting, 'a' to modify the kernel arguments
before booting, or 'c' for a command-line.
The highlighted entry will be booted automatic--More--(28%)onds.
kernel panic
2.6.32-220.23.1.el6.vSMP.4.x86_64
134.440344] dracut Warning: Boot has failed. To debug this issue add "rdshell" to the kernel command line.
134.452676] dracut Warning: Signal caught!
134.461269] dracut Warning: Boot has failed. To debug this issue add "rdshell" to the kernel command line.
134.472883] Kernel panic - not syncing: Attempted to kill init!
134.479917] Pid: 1, comm: init Tainted: G
----------------
2.6.32-220.23.1.el6.vSMP.4.x86_64 #1
134.491411] Call Trace:
134.494474]
[] ? panic+0x78/0x143
134.500391]
[] ? __raw_callee_save_vsmp_irq_enable+0x11/0x26
134.509085]
[] ? do_exit+0x862/0x870
--: wrong kernel parameter: isolcpus=!0-255
Unable to connect to license manager
ScaleMP 5-0-75.12
Retrieving licenses
FAILED (271)
Unable to connect to license manager (10.5.1.2:61319). Retry: (262)
Unable to connect to license manager (10.5.1.2:61319). Retry: (264)
Retrieving licenses
Retrieving licenses
/state/partition1/VSMP/ScaleMP/lmgr/vsmp.license
HOST gordon-fe2.local e 61319
ISV scalemp
LICENSE scalemp nlm_checkout_lock 1.0 permanent 1 hostid=ANY
_ck=421efc3e5d sig="60PG4521HYKKB6YTB0YVEACSBYNN1CK4MF8GYF022M08JHVJ
9A1MS8M861QRNUK56NBHVD185SQG"
LICENSE scalemp vsmp-slcb-s1 1.0 permanent 2176 hostid=ANY share=i
hold=300 options=eos_date=;sn=60dd458a0e _ck=a6bd8d6968 sig=
"60PG450THWF01CKV15RCNH3P29CHETG6N5CM5Q822M08RWB8N9X3J4MHWUS65JE8FXJ
A6NFT596G"
PBS cpuset 0
Starting PBS
mount: cpuset already mounted or /dev/cpuset busy
mount: according to mtab, cpuset is already mounted on /dev/cpuset
/bin/echo: write error: Numerical result out of range
/bin/echo: write error: Invalid argument
/etc/init.d/pbs
/bin/echo 0-511 > /dev/cpuset/torque/cpus
/bin/echo 0-64 > /dev/cpuset/torque/mems
For a 16-way node:
/etc/init.d/pbs
/bin/echo 0-255 > /dev/cpuset/torque/cpus
/bin/echo 0-32 > /dev/cpuset/torque/mems
gaussion crash
Out of 3 runs, it completed on
node gcn-17-11 crashed with a panic request
The issue has been identified and fixed in vSMP Foundation 5.0.75.20 which should be
released next week. (1/17/13)
gaussian not working: preload library problem
In testsh, please change following lines:
GHOME=/home/jpg/gaussian_nir
export LD_PRELOAD=\$GHOME/libompfix.so
===> export LD_PRELOAD=/home/jpg/gaussian_nir/libompfix.so
If you still see the same error, please send me output of "cat /proc//maps".
Bad ACL entry in host list MSG=First bad host: gordon-fe3.local
add host to /etc/hosts
qemu-kvm -smp
more than 16
max 16 cpus supported in KVM
default version: kvm-qemu-img-83-224.el5.centos.1
kvm-83-249.el5.centos.5.x86_64.rpm
hadoop install
yum install haddop
hadoop-1.0.3-4.x86_64 from Rocks-5.4.3 has depsolving problems
--> Missing Dependency: perl(Thrift) is needed by package hadoop-1.0.3-4.x86_64(Rocks-5.4.3)
hadoop-1.0.3-4.x86_64 from Rocks-5.4.3 has depsolving problems
--> Missing Dependency: perl(Types) is needed by package hadoop-1.0.3-4.x86_64(Rocks-5.4.3)
[root@ion-21-14 RPMS]# rpm -ivh --nodeps hadoop-1.0.3-4.x86_64.rpm
Preparing...
########################################### [100%]
########################################### [100%]
in /opt/hadoop
Could not initialize SDL - exiting
If you receive this message a qemu process is not able to write to your display.
You need to run qemu in a Windowed environment like GNOME or KDE. Not through ssh
Gordon: vncviewer
no vSMP SOL logs on IO nodes
vSMP configfile:
diag_device=serial
SolBaudRate=38400
on compute
SolBaudRate=115200,n,8
on IO nodes
SOL baud rate configuration matches the host serial port setting.
#ipmitool sol set help
SOL set parameters and values:
set-in-progress set-complete | set-in-progress | commit-write
true | false
force-encryption
true | false
force-authentication
true | false
privilege-level
user | operator | admin | oem
character-accumulate-level
character-send-threshold
retry-count
retry-interval
non-volatile-bit-rate
serial | 9.6 | 19.2 | 38.4 | 57.6 | 115.2
volatile-bit-rate
serial | 9.6 | 19.2 | 38.4 | 57.6 | 115.2
#ipmitool sol info 1
Set in progress
: set-complete
Force Encryption
Force Authentication
Privilege Level
Character Accumulate Level (ms) : 60
Character Send Threshold
Retry Count
Retry Interval (ms)
Volatile Bit Rate (kbps)
must match Remote Console Bit Rate in BIOS
Non-Volatile Bit Rate (kbps)
Payload Channel
: 1 (0x01)
Payload Port
[root@ion-1-1 ~]# ipmitool sol info 1
Set in progress
: set-complete
Force Encryption
Force Authentication
Privilege Level
Character Accumulate Level (ms) : 50
Character Send Threshold
Retry Count
Retry Interval (ms)
Volatile Bit Rate (kbps)
Non-Volatile Bit Rate (kbps)
Payload Channel
: 1 (0x01)
Payload Port
Compute node:
Set in progress
: set-complete
Force Encryption
Force Authentication
Privilege Level
Character Accumulate Level (ms) : 60
Character Send Threshold
Retry Count
Retry Interval (ms)
Volatile Bit Rate (kbps)
Non-Volatile Bit Rate (kbps)
Payload Channel
: 1 (0x01)
Payload Port
after power outage: IO node ipmi problem
ipmitool -I lanplus -H ion-1-6.ipmi
-U ADMIN -P ADMIN chassis power status
Error: Unable to establish IPMI v2 / RMCP+ session
Unable to get Chassis Power Status
unplug both power cables for a minute and reconnect
rocks run roll fails
xml.sax._exceptions.SAXParseException: :53:14: not well-formed (invalid token)
node install fails
rocks list host profile gcn-18-11
xml.sax._exceptions.SAXParseException: :46:10: syntax error
node arttribute misformed:
gcn-18-11: Kickstart_PrivateDNSServers
198.202.75.26
should be:
gcn-18-11: Kickstart_PrivateDNSServers
10.5.1.1,198.202.75.26
Input/output error
[root@gordon-fe1 init.d]# cat postfix
cat: postfix: Input/output error
sd 0:2:0:0: rejecting I/O to offline device
Aug 22 04:34:49 gcn-7-53 kernel: Filesystem "sdb1": xfs_log_force: error 5 returned.
ISCSI not communicating
List cpu binding per process
[4:43:50 PM] Shai Fultheim: the number in the middle is the PSR = processor.
[4:44:05 PM] Shai Fultheim: 12 processes, starts at cpu 256.
17-way: starts at 256:
# ps -eao pid,psr,cmd | grep socknal
[socknal_sd00]
[socknal_sd01]
[socknal_sd02]
[socknal_cd00]
9 [socknal_cd01]
9 [socknal_cd02]
9 [socknal_cd03]
9 [socknal_reaper]
grep socknal
34-way: starts at 512:
# ps -eLo pid,time,psr,ucmd | grep socknal
:00:06 512 socknal_sd00
:00:01 513 socknal_sd01
:00:03 514 socknal_sd02
:00:04 515 socknal_sd03
:00:00 517 socknal_cd03
:00:00 513 socknal_reaper
:00:00 513 socknal_cd04
:00:00 513 socknal_cd05
rocks run host vsmp command="ps -eLo pid,time,psr,ucmd | grep socknal" collate=true
> /tmp/vsmp.luster.procs
List binary code
#strings /lib/modules/2.6.32-220.7.1.el6.vSMP.3.x86_64/kernel/drivers/net/bnx2x/bnx2x.ko | grep -i version=
version=1.70.00-0
srcversion=6A5A19EFF0B787AC073CB60
New IO node BIOS
Configuring system
Board BIOS setting (HT off):
Board BIOS setting (HT off):
81:00.0#1=>03:05.07.09.11.13.15.17
19.21.23.25.27.29.31
Board BIOS setting (HT on):
Board BIOS setting (HT on):
81:00.0#1=>..:....................
....................
BIOS settings not equal on all boards
please contact ScaleMP
IPMIView - KVM console
after power cycle hit the Del key
BIOS setting:
VT disabled
(Advanced - Processor&Clock Options - Intel VT)
SMT disabled
(Advanced - Processor&Clock Options - simultaneous multithreading (SMT))
Start IPMIView:
===============
wget /utility/IPMIView/IPMIView-2.9.25-build130828.zip
tar -xzvf IPMIView-2.9.25-build130828/IPMIView_2.9.25_bundleJRE_Linux_x64_.tar.gz
on rzri6e:
# ssh -Y -C -L 5988:localhost:5988 gordon-fe3
on gordon-fe3: # vncserver :88 -localhost
on rzri6e:
# vncviewer localhost:88
in IPMIView window:
[hocks@gordon-fe3 ~]$ cd IPMIView_2.9.25_bundleJRE_Linux_x64_
[hocks@gordon-fe3 IPMIView]$ ./IPMIView20
bonding set with wrong interface included
ifenslave -d bond1 eth18
service network restart
vSMP panic crash gcn-3-31 7/6
accidental reboot of ion-21-6 (Rick)
vSMP Diagnostics
during boot:
for system settings / diagnostics (continuing in 5 seconds)
Diagnostics
BIOS Information
Vendor: Intel Corp.
Version: SE5C600.86B.99.99.x038.
Release Date: 10/24/2011
Warning: memory access to device 01:00.0 failed: Input/output error.
Warning: Fallback on IO: much slower, and unsafe if device in use.
Save to USB
-> Windows FAT format
lsscsi with el6 kernel
[102:0:0:0]
INTEL SSDSA2BZ30 0362
[102:0:1:0]
INTEL SSDSA2BZ30 0362
[102:0:2:0]
INTEL SSDSA2BZ30 0362
[102:0:3:0]
INTEL SSDSA2BZ30 0362
cat /proc/scsi/scsi
grep ^ /sys/block/*/device/model | grep SSDSA2BZ30$ | awk -F/'{print $4}'
lustre patch 4/28
lustre code: /home/hocks/rpmbuild/SOURCES/lustre_b1_8.tgz
Could you rebuild lustre on this node.
Please note that latest version of the placement
patch is at ~nir/lustre/lustre_b1_8.thread-affinity.diff.
The changes from previous
version is that placement now affect all threads, even ones that created on runtime
(connection daemons).
The control parameter changed from nsched to ncpus.
/etc/modprobe.d/ksocklnd should be:
options ksocklnd ncpus=12
options ksocklnd cpu_affinity_off=256
cd /home/hocks/rpmbuild/BUILD
../SOURCES/lustre_b1_8.tgz
cd lustre_b1_8
patch -p1 = 1.9... found 1.9.6
checking for autoconf >= 2.57... build/autogen.sh: line 53: autoconf: command not found
autoconf is missing.
Version 2.57 (or higher) is required to build Lustre.
--: install autogen 5.59
yum remove autoconf
yum install autoconf.noarch
yum install automake
./configure --with-linux=/usr/src/kernels/2.6.32-220.7.1.el6.vSMP.4.x86_64
--with-o2ib=no
Sat, 28 Apr 2012
EXT3-fs aborted journal
vSMP kernel: 2.6.32-220.7.1.el6.vSMP.3.x86_64
[] journal commit I/O error
[] EXT3-fs (sdag2): error: ext3_journal_start_sb: Detected aborted journal
[] EXT3-fs (sdag2): error: remounting filesystem read-only
[] EXT3-fs error (device sdag2): ext3_find_entry: reading directory #576616 offset 0
[528] EXT3-fs error (device sdag2): ext3_readdir:
this is a known problem.
Suggested fix is :
"have nobarrier in /etc/fstab for each ext3/ext4 entry,
as well as rootflags=nobarrier in /boot/grub/menu.lst as one of the kernel params".
This is *counter intuitive* but documented to resolve the issue.
I do not know if that is same issue as yours, but I checked and redhat kernel (which we are
now using), as well as community kernel 3.1 and on set barriers to ON in ext3_fill_super()
by default.
To clarify, in our old 2.6.32.54.xx kernel it was set to OFF.
To compare,
you can search for "set_opt(sbi->s_mount_opt, BARRIER)" in fs/ext3/super.c of a target
kernel (see http://lxr.linux.no/linux+v3.0/fs/ext3/super.c#L1716 vs. http://lxr.linux.no/linux+v3.1/fs/ext3/super.c#L1728).
The git commit for this change is:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.a=h=00eacd66cd8ab5fff9df49aa3f261ad43d495434).
More about this: http://lwn.net/Articles/283161/.
qsub -I fails
-l nodes=gcn-3-51:debug,walltime=1:00:00
Apr 23 15:51:58 gcn-17-71 pbs_mom: LOG_ERROR::Operation now in progress (115) in
TMomFinalizeChild, cannot open interactive qsub socket to host gordon-ln2.local:34606 -
'cannot connect to port 1023 in client_to_svr - connection refused' - check routing
tables/multi-homed host issues
I do see a bad job in the system:
Negative remaining time (40232.gordon-fe2.local) for (-331.0).
Using Dispatch Time - Now, instead!
40232.gordon-fe2.local
-17:36:31 04/22 10:30
Job removed solved the problem
linux hetro-mode boot hangs
booting mixed hardware with mix_boards=15
Booting processor 256 APIC 0x1000 ip 0x6000
--: grub.conf
include arg=.... noxsave
kernel bug:
cannot run code with AVX instructions:
hetro-mode with noxsave parameter otherwise the linux kernel think all processors have xsave support.
See: http://lxr.linux.no/linux+v2.6.32.54/Documentation/kernel-parameters.txt#L1657
Lustre performance over 10GigE (ion)
$ numactl -cpunodebind=65 /opt/iperf/bin/iperf -c 172.25.33.112 -t 60 -w 32M -i 5 -P 4
------------------------------------------------------------
Client connecting to 172.25.33.112, TCP port 5001
TCP window size:
256 KByte (WARNING: requested 32.0 MByte)
------------------------------------------------------------
5] local 192.168.230.92 port 34780 connected with 172.25.33.112 port 5001
4] local 192.168.230.92 port 34777 connected with 172.25.33.112 port 5001
3] local 192.168.230.92 port 34778 connected with 172.25.33.112 port 5001
6] local 192.168.230.92 port 34779 connected with 172.25.33.112 port 5001
[ ID] Interval
0.0- 5.0 sec
268 MBytes
450 Mbits/sec
0.0- 5.0 sec
259 MBytes
434 Mbits/sec
0.0- 5.0 sec
273 MBytes
458 Mbits/sec
0.0- 5.0 sec
271 MBytes
455 Mbits/sec
0.0- 5.0 sec
1.05 GBytes
1.80 Gbits/sec
$ numactl --cpunodebind=65 /opt/iperf/bin/iperf -f M -p 5001 -t 20 -c 172.25.33.234
------------------------------------------------------------
Client connecting to 172.25.33.234, TCP port 5001
TCP window size: 0.03 MByte (default)
------------------------------------------------------------
3] local 192.168.230.92 port 52951 connected with 172.25.33.234 port 5001
[ ID] Interval
0.0-20.0 sec
21886 MBytes
1094 MBytes/sec
It drops with more threads:
[diag@gcn-17-51 ~]$ numactl --cpunodebind=65 /opt/iperf/bin/iperf -f M -p 5001 -t 20 -c 172.25.33.234 -P 4
------------------------------------------------------------
Client connecting to 172.25.33.234, TCP port 5001
TCP window size: 0.03 MByte (default)
------------------------------------------------------------
6] local 192.168.230.92 port 52948 connected with 172.25.33.234 port 5001
3] local 192.168.230.92 port 52945 connected with 172.25.33.234 port 5001
4] local 192.168.230.92 port 52946 connected with 172.25.33.234 port 5001
5] local 192.168.230.92 port 52949 connected with 172.25.33.234 port 5001
[ ID] Interval
0.0-20.0 sec
1121 MBytes
56.0 MBytes/sec
0.0-20.0 sec
3018 MBytes
151 MBytes/sec
0.0-20.0 sec
1079 MBytes
53.8 MBytes/sec
0.0-20.1 sec
3195 MBytes
159 MBytes/sec
0.0-20.1 sec
8412 MBytes
419 MBytes/sec
tests with cpunodebind:
numactl --cpunodebind=1 dd ibs=4M if=/oasis/scratch/pcicotti/temp_project/striped/file obs=4M of=/scratch1/file iflag=direct oflag=direct
| Performance
--------------+------------------+--------------------
| cpunode64,65
| cpunode64,65
| ~150MB/s
tests with fio:
fio --ioengine=sync --thread --iodepth=1 --group_reporting --time_based --runtime=60 --filename=testfile --direct=1 --bs=4m --rw=read --numjobs=1 --size=1g --name=1 --stonewall Here you can replace read with write, everything else is the same. You
can also increase numjobs, but I observed that it makes no difference on vsmp,
although it does slightly on native.
ksh syntax error
[ Read 365 lines (Converted from DOS format) ]
vsmppp.script
' unexpectedt: syntax error at line 40: `do
$ unix2dos vsmppp.script > vsmppp.script.n
vsmppp.script.n
# rocks add bootaction action="vSMP16p" ramdisk="vSMP16.p"
ip, = self.db.fetchone()
TypeError: unpack non-sequence
# rocks report bug
fixed by adding an interface on the private network to gscb-3-1
Configuring system 59%
FAILED (909)
Unable to initialize backplane interface
- please contact ScaleMP
We ha however in order to further troubleshoot it would help if we
can have full remote access to this super-node:
1) The ability to power-cycle
2) Access to SOL
3) Ability to load specific version of vSMP Foundation
Please let us know if/when this can be arranged.
killall -9 paratec
paratec run, killing a user process should not bring down the node
this is again a software issue on our end, in the *same* module
mentioned earlier (PINNING/NASRAM)
PINNING issues are now fixed version 4.0.115.8
3-11 kernel dump
[] INFO: task paratec:30820 blocked for more than 120 seconds.
[] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[] ? nfs_wait_bit_uninterruptible+0x0/0x10[nfs]
software issue in vSMP_Foundation module (PINNING/NASRAM)
17-11: Link Rate:
[scalemp@gordon-fe1 logs]$ grep -a 'Finding secondary boards'
gcn-17-11.sol.out.-09\:56 | tail -2 | uniq
Finding secondary boards (34/34):
01:00.0#1=>03:05.07.09.15.17.23.25
While in 3-11, you could see:
[scalemp@gordon-fe1 logs]$ grep -a 'Finding secondary boards'
gcn-3-11.sol.out.-10\:42 | tail -2 |
Finding secondary boards (34/34):
81:00.0#1=>03:05.09.11.17.19.29.31
Finding secondary boards (34/34):
01:00.0#1=>03:05.09.11.17.19.31.07
kernel dump
Call Trace
warn_slowpath_common
boot from /dev/sdah
/etc/grub.conf kernel setting:
kernel /boot/vmlinuz-2.6.32.46-9.vSMP ro root=/dev/sdag1
"ps" hung :
NFS server went down?
The fact that it's hung trying to read information about pid 17398, and pid 17398 is in D (disk wait) state,
crash, running paratec,
board 5 problem? 17-16
crash, running hpcc ,
port 33 2-48 long recovery
SOL log kernel parameter:
rhgb console=ttyS0,115200 console=tty1 earlyprint k=serial,ttyS0,115200 consoleblank=0 debug
17-76 fault lights red: bad memory or ipmi firmware issue
dmidecode -t 17
17-11 crash
memory pinning issue
64-core Paratec wrong result
Sun Jan 1 09:14:39 PST 2012
Error in Total Energy: -11 found - should be -24
NERSC_TIME 240.090
NERSC time is the total wall clock time for the job rather than a time
per board or MPI process.
2 failures out of 7562 runs to date
possible issue in the MPI library (not vSMP Foundation)
The fix for it is already available as part of a new version of mpich2 (3.5.13)
fault light on, vSMP 4.0.115
17-21 fault light: failed DIMM, restarted , HT enabled
12/27: occasional slowdown WRF runs
Good runs take about 455-480s. The bad ones go s.
Kernel team reviewed dmesg output, and believe this is caused by a
known (community) kernel bug using a global lock (cpa_lock).
Although this issue is rare, you can upgrade 1 super-node to a newer kernel and re-test
it. We are about to release this kernel as our new GA kernel within a few weeks, but it is
already being used by several customers.
The RPM files are here:
/uploads/shared/kernel-2.6.32.50-1.vSMP.tar.bz2
MD5Sum: 07c4eaf599eed677f563
The full archive for binary installation is here:
/uploads/shared/kernel-2.6.32.50-1.vSMP.x86_64.tar.bz2
MD5Sum: fab7d3a46bb1f273ce4eb99
18-13.. all
boot hangs with grub error 15
18-53.. all
boot hangs with grub error 15
Wed Dec 21 13:54:52 PST 2011
gcn-2-51: Wed Dec 21 14:00:27 PST 2011
Wed Dec 21 13:54:54 PST 2011
gcn-3-51: Wed Dec 21 14:00:29 PST 2011
Wed Dec 21 13:54:56 PST 2011
gcn-4-51: off
never used
gcn-5-11: up
ion-1-13 Unable to set Chassis Power Control to Cycle
IB problems with the ion nodes. port 0 in rack5 which connects to the ION
nodes is either flopping (2nd switch) or missing
gcn-5-51: Wed Dec 21 14:00:33 PST 2011
gcn-6-11: down
ion-10-1: down ion-10-2: down
Error: Unable to establish IPMI v2 / RMCP+ session
Error: No response activating SOL payload
gcn-6-51: Wed Dec 21 14:00:35 PST 2011
gcn-7-11: Wed Dec 21 13:55:02 PST 2011
gcn-7-51: down
panic request points to an IB cabling problem, either with the 1st or 15th board
sol logs sent to ScaleMP
gcn-8-11: Wed Dec 21 14:03:19 PST 2011
gcn-8-51: Wed Dec 21 14:16:21 PST 2011
gcn-9-11: down
IB ports problem: port IB0 25, 33
IB1, port IB1 07, 25 (9-18 and 9-28)
gcn-9-51: off
never used
up 2 days, 18:46,
load average: 512.11
load average: 0.00
blue screen 0x1d8728, fixed in 4.0.110 (12/21)
VT not enabled, new BIOS 038, bios tool syscfg updated (12/19)
IPMI missing gcn-3-17
degraded link
10 Gb/s , port 25 2nd switch
IPMI missing ion-1-12
HT enabled (BIOS shows disabled) gcn-5-73 ; up
blue screen CPU 0@0
motherboard replaced 12/14
blue screen CPU 0@0
blue screen CPU 0@0
blue screen CPU 0@1
port 11 gcn-17-61, cable reseated (12/7)
17-78 is crossed with 17-58 (12/13)
gcn-17-67 (port 31)
gcn-17-68 (port 33)
blue screen CPU 0@0
-- ib boards problem, gcn-19-13
does not boot (12/1)
-- reinstalled (12/5)
18-11 ipmi activate (12/5)
after boot
18-52 SOL activate, local root disk reinstall (12/5)}

淘宝游戏网