I've got a cluster with a mix of IB cards (MT27500, MT26428, MT2700) cards. Part of the use is for MPI + IBverbs for high bandwidth/low latency messages. But part of the use is using NFS + IPoIB. Before the newest cards I used connected mode, which worked well. But apparently there's a new "enhanced IPoIB" that uses datagram mode, thus connected mode is disabled by default.
Is that a purely software update? Can it be used on the last few generations of mellanox cards? Or does it depend on the hardware? If depending on the hardware, how do I get the connected mode enabled with the newest (100Gbit) cards?
I installed the newest driver on ubuntu 18.04 LTS and the install want well, it spit this out:
Device #1:
----------
Device Type: ConnectX4
Part Number: MCX455A-ECA_Ax
Description: ConnectX-4 VPI adapter card; EDR IB (100Gb/s) and 100GbE; single-port QSFP28; PCIe3.0 x16; ROHS R6
PSID: MT_2180110032
PCI Device Name: 42:00.0
Base GUID: 506b4b0300f36e34
Base MAC: 506b4bf36e34
Versions: Current Available
FW 12.23.1020 12.23.1020
PXE 3.5.0504 3.5.0504
UEFI 14.16.0017 14.16.0017
Status: Up to date
Configuring /etc/security/limits.conf.
Device (42:00.0):
42:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
Link Width: x16
PCI Link Speed: 8GT/s
Installation passed successfully
To load the new driver, run:
/etc/init.d/mlnx-en.d restart
This same node was working with the inbox drives, IPoIB worked (in datagram mode), ibstat was happy. Now ibstat doesn't work and ifconfig doesn't find a ib0.
On boot the device is found:
# dmesg | grep mlx
[ 3.035024] mlxfw: loading out-of-tree module taints kernel.
[ 3.092878] mlxfw: module verification failed: signature and/or required key missing - tainting kernel
[ 3.245384] mlx5_core 0000:42:00.0: firmware version: 12.23.1020
[ 3.245413] mlx5_core 0000:42:00.0: PCIe link speed is 8.0GT/s, device supports 8.0GT/s
[ 3.245414] mlx5_core 0000:42:00.0: PCIe link width is x16, device supports x16
[ 5.832251] mlx5_core 0000:42:00.0: Port module event: module 0, Cable plugged
[ 6.813408] mlx5_core 0000:42:00.0: FW Tracer Owner
If I restart the driver as the install mentions:
[ 3852.805318] PKCS#7 signature not signed with a trusted key
[ 3852.806388] Compat-mlnx-ofed backport release: ee7aa0e
[ 3852.806389] Backport based on mlnx_ofed/mlnx-ofa_kernel-4.0.git ee7aa0e
[ 3852.806390] compat.git: mlnx_ofed/mlnx-ofa_kernel-4.0.git
[ 3852.812135] PKCS#7 signature not signed with a trusted key
[ 3852.826716] PKCS#7 signature not signed with a trusted key
[ 3852.837116] PKCS#7 signature not signed with a trusted key
[ 3852.875073] PKCS#7 signature not signed with a trusted key
[ 3852.882801] mlx5_core 0000:42:00.0: firmware version: 12.23.1020
[ 3852.882833] mlx5_core 0000:42:00.0: PCIe link speed is 8.0GT/s, device supports 8.0GT/s
[ 3852.882836] mlx5_core 0000:42:00.0: PCIe link width is x16, device supports x16
[ 3855.470717] mlx5_port_module_event: 5 callbacks suppressed
[ 3855.470724] mlx5_core 0000:42:00.0: Port module event: module 0, Cable plugged
[ 3856.536266] mlx5_core 0000:42:00.0: FW Tracer Owner
[ 3856.537406] PKCS#7 signature not signed with a trusted key
Any ideas?
Oh, the forums mentioned:
# cat ib_ipoib.conf
options ib_ipoib ipoib_enhanced=0
Which resulted in
[ 4128.198929] ib_ipoib: unknown parameter 'ipoib_enhanced' ignored