Well, it was one busy weekend troubleshooting and a lot of work. I may have solved few issues but it is not perfect yet!
The OEM updates (tried few) would not work because of PSID mistmatch, if there is a work around, please let me know. I'm not able to find any firmware online for PSID of the switch M3601Q.
[root@headnode Infini Switch firmware]# ls
fw-sx-9_2_8000-0269NG_B1.bin
[root@headnode Infini Switch firmware]# lspci | grep Mellanox
07:00.0 Infiniband controller: Mellanox Technologies MT27500 Family [ConnectX-3]
[root@headnode Infini Switch firmware]# mstflint -d 07:00.0 -i fw-sx-9_2_8000-0269NG_B1.bin b
Current FW version on flash: 2.10.2132
New FW version: 9.2.8000
-E- PSID mismatch. The PSID on flash (DEL0A10210018) differs from the PSID in the given image (DEL09E0210003).
[root@headnode Infini Switch firmware]#
I tried forcing GUID through command line as suggested as I don't have a opensm.conf file anywhere.
Then I went ahead and uninstalled Mellanox OFED and started with Open Fabrics OFED. There were few missing errors (cmake, libnl3-devel, numactl-devel, devel-grind), after getting those rpm's and dependencies all sorted, it did install. The Port GUID did recognize and infiniband is active. DHCP didn't do it, so I set it up as manual, may not be perfect yet. The issues lingering now are OFED related, I cant seem to get opensm run auto, it has to be started with #/etc/init.d/opensmd start. After starting it, ibv_devinfo and nmcli connection show gives:
[root@headnode ~]# ibv_devinfo
hca_id: mlx4_0
transport: InfiniBand (0)
fw_ver: 2.10.2132
node_guid: 0002:c903:00f9:32f0
sys_image_guid: 0002:c903:00f9:32f3
vendor_id: 0x02c9
vendor_part_id: 4099
hw_ver: 0x0
board_id: DEL0A10210018
phys_port_cnt: 2
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid: 1
port_lmc: 0x00
link_layer: InfiniBand
port: 2
state: PORT_DOWN (1)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: InfiniBand
[root@headnode ~]# nmcli connection show
NAME UUID TYPE DEVICE
Wired connection 2 a40b3b41-66e7-3d87-a77c-e79ccd002698 802-3-ethernet em1
Wired connection 3 7b5a96ce-3df4-3534-8a35-b430f3f1e3e5 802-3-ethernet em2
ib0 b4fdfa83-45ba-4904-a8ec-377234b898ee infiniband ib0
virbr0 d36acaba-3663-4199-ae03-0b2a39aa75df bridge virbr0
Bridge em1 1dad842d-1912-ef5a-a43a-bc238fb267e7 bridge --
Bridge em2 0578038a-64e9-a2fd-0a28-e4cd0b553930 bridge --
System ib0 2ab4abde-b8a5-6cbc-19b1-2bfb193e4e89 infiniband --
System pem1 c19149d5-4e53-4636-b52a-81d213a8a3cb 802-3-ethernet --
System pem2 7379072d-ea75-335e-2486-0afa3cd10c77 802-3-ethernet --
Wired connection 1 d4070b38-e850-4a48-83a7-223ecca993f7 802-3-ethernet --
ib0 4e22b1f1-3e0c-4b84-b0d9-85b0755728ac infiniband --
ib0 152321c5-8ba1-4865-9eca-5a18a889ffb7 infiniband --
ib1 9fd439a6-da5e-4928-9265-47a636b3aaea infiniband --
#ifconfig -a ib0
ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 65520
inet 10.1.27.7 netmask 255.0.0.0 broadcast 10.1.77.77
inet6 fe80::202:c903:f9:32f1 prefixlen 64 scopeid 0x20<link>
Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
infiniband 80:00:02:08:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 256 (InfiniBand)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 289 bytes 19652 (19.1 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
Next resolution: I'm waiting on two Dell flash SD's for CMC, so I can get all drivers updated on the chassis and nodes. It is a lot slower through UEFI and some drivers are too big anyway. Hopefully the I/O update may help! Next, I may do a fresh install of Rocks Cluster 7 (Manzanita) and try the prior versions of Mellanox OFED such as 4.1 or 3.xx. I can come back to OFED as well.
Issues persisting: The commands ibstat, ibhosts, etc. of OFED do not work, perhaps a failure on OFED side. The ib0 still shows hardware error, perhaps firmware issue. HCA test command do not work, but seems good as port is active. I have a different issue of Rocks Clusters command "insert-ethers" non responding to connect the switch and compute nodes, hence the reinstall.
Sorry, seems like a mess, thank you for the time! I know I'll get around it one way or the other, may even have to buy a newer m4001 switch that has current drivers. Wonder if Mellanox will share an archive m3601q firmware?