Need help, I'm running out of ideas!
I have a Dell M1000e blade chassis with M3601Q 40gbps Mellanox infiniband switches in I/O slot B1C1, connects to Midplane on C1. I have M910 Poweredge blades with J05yt connectX3 mezzanine card plugged. I have installed latest MLNX OFED 4.4. The OS is based on CentOS7.4 within Rocks Manzanita cluster. Since it is a blade, connection is via midplane. Switch lights are steady and good.
After following prior posts, executing the commands such as ibhosts, ibstat, lspci | grep Mell, lspci -Qvvs 07:00.0, ifcong -a, HCA_self_test.ofed, and mstflint -d 07:00.0 q, the best I can tell is my port is down/Initializing andI have subnet manager issue. I cannot get it Active or an IP show. Can you please help me diagnose? I'll post some needed output, let me know what else is required.
Thank you much!
[root@headnode /]# hca_self_test.ofed
---- Performing Adapter Device Self Test ----
Number of CAs Detected ................. 2
PCI Device Check ....................... PASS
Kernel Arch ............................ x86_64
Host Driver Version .................... MLNX_OFED_LINUX-4.4-2.0.7.0 (OFED-4.4-2.0.7): 3.10.0-693.el7.x86_64
Host Driver RPM Check .................. PASS
Firmware on CA #0 HCA .................. v2.10.2132
Firmware on CA #1 HCA .................. v2.10.2132
Host Driver Initialization ............. PASS
Number of CA Ports Active .............. 0
Port State of Port #1 on CA #0 (HCA)..... DOWN (InfiniBand)
Port State of Port #2 on CA #0 (HCA)..... DOWN (InfiniBand)
Port State of Port #1 on CA #1 (HCA)..... INIT (InfiniBand)
Port State of Port #2 on CA #1 (HCA)..... DOWN (InfiniBand)
Error Counter Check on CA #0 (HCA)...... FAIL
REASON: found errors in the following counters
Errors in /sys/class/infiniband/mlx4_0/ports/1/counters
link_error_recovery: 93
symbol_error: 65535
Error Counter Check on CA #1 (HCA)...... PASS
Kernel Syslog Check .................... PASS
Node GUID on CA #0 (HCA) ............... 00:02:c9:03:00:f9:2e:80
Node GUID on CA #1 (HCA) ............... 00:02:c9:03:00:f9:32:f0
------------------ DONE ---------------------
[root@headnode /]# ibhosts
Ca : 0x0002c90300f92e80 ports 2 "headnode HCA-1"
[root@headnode /]# ibstat
CA 'mlx4_0'
CA type: MT4099
Number of ports: 2
Firmware version: 2.10.2132
Hardware version: 0
Node GUID: 0x0002c90300f92e80
System image GUID: 0x0002c90300f92e83
Port 1:
State: Down
Physical state: Polling
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02514868
Port GUID: 0x0002c90300f92e81
Link layer: InfiniBand
Port 2:
State: Down
Physical state: Polling
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02514868
Port GUID: 0x0002c90300f92e82
Link layer: InfiniBand
CA 'mlx4_1'
CA type: MT4099
Number of ports: 2
Firmware version: 2.10.2132
Hardware version: 0
Node GUID: 0x0002c90300f932f0
System image GUID: 0x0002c90300f932f3
Port 1:
State: Initializing
Physical state: LinkUp
Rate: 40
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02514868
Port GUID: 0x0002c90300f932f1
Link layer: InfiniBand
Port 2:
State: Down
Physical state: Polling
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02514868
Port GUID: 0x0002c90300f932f2
Link layer: InfiniBand
[root@headnode /]#
[root@headnode /]# mstflint -d 05:00.0 q
Image type: FS2
FW Version: 2.10.2132
Device ID: 4099
Description: Node Port1 Port2 Sys image
GUIDs: 0002c90300f92e80 0002c90300f92e81 0002c90300f92e82 0002c90300f92e83
MACs: 000000000000 000000000000
VSD:
PSID: DEL0A10210018
[root@headnode /]# lspci -Qvvs 05:00.0
05:00.0 Infiniband controller: Mellanox Technologies MT27500 Family [ConnectX-3]
Subsystem: Mellanox Technologies ConnectX-3 IB QDR Dual Port Mezzanine Card
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 34
Region 0: Memory at fb100000 (64-bit, non-prefetchable) [size=1M]
Region 2: Memory at f4800000 (64-bit, prefetchable) [size=8M]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [48] Vital Product Data
Product Name: DELL ConnectX-3 Mezz
Read-only fields:
[PN] Part number: 0J05YT
[EC] Engineering changes: A00
[SN] Serial number: IL0J05YT7403125S000Q
[V0] Vendor specific: DDR/QDR SFF mezz
[RV] Reserved: checksum good, 0 byte(s) reserved
Read/write fields:
[V1] Vendor specific: N/A
[YA] Asset tag: N/A
[RW] Read-write area: 107 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 252 byte(s) free
End
Capabilities: [9c] MSI-X: Enable+ Count=128 Masked-
Vector table: BAR=0 offset=0007c000
PBA: BAR=0 offset=0007d000
Capabilities: [60] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 unlimited
ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 116.000W
DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
LnkCap: Port #8, Speed 8GT/s, Width x8, ASPM L0s, Exit Latency L0s unlimited, L1 unlimited
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported
DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [100 v1] Alternative Routing-ID Interpretation (ARI)
ARICap: MFVC- ACS-, Next Function: 0
ARICtl: MFVC- ACS-, Function Group: 0
Capabilities: [148 v1] Device Serial Number 00-02-c9-03-00-f9-2e-80
Capabilities: [154 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt+ UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES- TLP+ FCP+ CmpltTO+ CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
Capabilities: [18c v1] #19
Kernel driver in use: mlx4_core
Kernel modules: mlx4_core
[root@headnode ~]# sminfo -p 1
ibwarn: [8670] _do_madrpc: recv failed: Connection timed out
ibwarn: [8670] mad_rpc: _do_madrpc failed; dport (DR path slid 0; dlid 0; 0)
sminfo: iberror: failed: query
[root@headnode ~]# sminfo -p 2
ibwarn: [8684] _do_madrpc: recv failed: Connection timed out
ibwarn: [8684] mad_rpc: _do_madrpc failed; dport (DR path slid 0; dlid 0; 0)
sminfo: iberror: failed: query
[root@headnode ~]#
Opensm
******************************************************************
****************** ERRORS DURING INITIALIZATION ******************
******************************************************************
Sep 12 11:18:51 735239 [A3E15700] 0x01 -> state_mgr_check_tbl_consistency: ERR 3322: lid 1 is wrongly assigned to port 0x0002c90300f92e81 ('headnode HCA-1' port 1) in port_lid_tbl
Sep 12 11:18:51 735367 [A3E15700] 0x02 -> state_mgr_check_tbl_consistency: Clearing Lid for port 0x0002c90300f92e81
Sep 12 11:18:51 735375 [A3E15700] 0x01 -> state_mgr_check_tbl_consistency: ERR 3322: lid 3 is wrongly assigned to port 0x0002c90300f932f1 ('headnode HCA-2' port 1) in port_lid_tbl
Sep 12 11:18:51 735392 [A3E15700] 0x02 -> state_mgr_check_tbl_consistency: Clearing Lid for port 0x0002c90300f932f1
Sep 12 11:18:51 735430 [A3E15700] 0x01 -> osm_ucast_port_is_zero_lid: ERR 3A04: Port 0x2c90300f932f1 (headnode HCA-2 port 1) has LID 0. An initialization error occurred. Ignoring port
Sep 12 11:18:51 735449 [A3E15700] 0x01 -> osm_ucast_port_is_zero_lid: ERR 3A04: Port 0x2c90300f92e81 (headnode HCA-1 port 1) has LID 0. An initialization error occurred. Ignoring port
Sep 12 11:18:51 735462 [A3E15700] 0x01 -> osm_ucast_port_is_zero_lid: ERR 3A04: Port 0x2c90300f92e81 (headnode HCA-1 port 1) has LID 0. An initialization error occurred. Ignoring port
Sep 12 11:18:51 735468 [A3E15700] 0x01 -> osm_ucast_port_is_zero_lid: ERR 3A04: Port 0x2c90300f932f1 (headnode HCA-2 port 1) has LID 0. An initialization error occurred. Ignoring port
Sep 12 11:18:51 735480 [A3E15700] 0x02 -> osm_ucast_mgr_process: minhop tables configured on all switches
Sep 12 11:18:51 740351 [A3E15700] 0x80 -> Errors during initialization
Sep 12 11:18:51 740385 [A3E15700] 0x01 -> do_sweep:
[root@headnode ~]# nmcli connection show ib0
connection.id: ib0
connection.uuid: 65aec7ac-2335-44aa-b9c2-0945379d8111
connection.stable-id: --
connection.interface-name: ib0
connection.type: infiniband
connection.autoconnect: yes
connection.autoconnect-priority: 0
connection.autoconnect-retries: -1 (default)
connection.timestamp: 0
connection.read-only: no
connection.permissions: --
connection.zone: --
connection.master: --
connection.slave-type: --
connection.autoconnect-slaves: -1 (default)
connection.secondaries: --
connection.gateway-ping-timeout: 0
connection.metered: unknown
connection.lldp: -1 (default)
ipv4.method: auto
ipv4.dns: --
ipv4.dns-search: --
ipv4.dns-options: (default)
ipv4.dns-priority: 0
ipv4.addresses: --
ipv4.gateway: --
ipv4.routes: --
ipv4.route-metric: -1
ipv4.ignore-auto-routes: no
ipv4.ignore-auto-dns: no
ipv4.dhcp-client-id: --
ipv4.dhcp-timeout: 0
ipv4.dhcp-send-hostname: yes
ipv4.dhcp-hostname: --
ipv4.dhcp-fqdn: --
ipv4.never-default: yes
ipv4.may-fail: yes
ipv4.dad-timeout: -1 (default)
ipv6.method: link-local
ipv6.dns: --
ipv6.dns-search: --
ipv6.dns-options: (default)
ipv6.dns-priority: 0
ipv6.addresses: --
ipv6.gateway: --
ipv6.routes: --
ipv6.route-metric: -1
ipv6.ignore-auto-routes: no
ipv6.ignore-auto-dns: no
ipv6.never-default: no
ipv6.may-fail: yes
ipv6.ip6-privacy: 0 (disabled)
ipv6.addr-gen-mode: stable-privacy
ipv6.dhcp-send-hostname: yes
ipv6.dhcp-hostname: --
ipv6.token: --
infiniband.mac-address: 80:00:02:08:FE:80:00:00:00:00:00:00:00:02:C9:03:00:F9:32:F1
infiniband.mtu: auto
infiniband.transport-mode: connected
infiniband.p-key: default
infiniband.parent: --
proxy.method: none
proxy.browser-only: no
proxy.pac-url: --
proxy.pac-script: --