Best IPoIB settings for a mixed environment of ConnectX-2,3,4 (MT27500, MT26428, MT27700)?

September 19, 2018, 6:06 pm

≫ Next: Oracle vs Teradata vs Hadoop for Big Data

≪ Previous: Support for "INBOX drivers?" for 18.04/connected mode?

I've got a cluster with a mix of IB cards (MT27500, MT26428, MT2700) cards. Part of the use is for MPI + IBverbs for high bandwidth/low latency messages. But part of the use is using NFS + IPoIB. Before the newest cards I used connected mode, which worked well. But apparently there's a new "enhanced IPoIB" that uses datagram mode, thus connected mode is disabled by default.

Is that a purely software update? Can it be used on the last few generations of mellanox cards? Or does it depend on the hardware? If depending on the hardware, how do I get the connected mode enabled with the newest (100Gbit) cards?

I installed the newest driver on ubuntu 18.04 LTS and the install want well, it spit this out:

Device #1:
----------
Device Type:      ConnectX4
Part Number:      MCX455A-ECA_Ax
Description:      ConnectX-4 VPI adapter card; EDR IB (100Gb/s) and 100GbE; single-port QSFP28; PCIe3.0 x16; ROHS R6
PSID:             MT_2180110032
PCI Device Name: 42:00.0
Base GUID:        506b4b0300f36e34
Base MAC:         506b4bf36e34
Versions:         Current        Available
     FW             12.23.1020     12.23.1020
     PXE            3.5.0504       3.5.0504
     UEFI           14.16.0017     14.16.0017
Status:           Up to date
Configuring /etc/security/limits.conf.
Device (42:00.0):
        42:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
        Link Width: x16
        PCI Link Speed: 8GT/s
Installation passed successfully
To load the new driver, run:
/etc/init.d/mlnx-en.d restart

This same node was working with the inbox drives, IPoIB worked (in datagram mode), ibstat was happy. Now ibstat doesn't work and ifconfig doesn't find a ib0.

On boot the device is found:

# dmesg | grep mlx
[    3.035024] mlxfw: loading out-of-tree module taints kernel.
[    3.092878] mlxfw: module verification failed: signature and/or required key missing - tainting kernel
[    3.245384] mlx5_core 0000:42:00.0: firmware version: 12.23.1020
[    3.245413] mlx5_core 0000:42:00.0: PCIe link speed is 8.0GT/s, device supports 8.0GT/s
[    3.245414] mlx5_core 0000:42:00.0: PCIe link width is x16, device supports x16
[    5.832251] mlx5_core 0000:42:00.0: Port module event: module 0, Cable plugged
[    6.813408] mlx5_core 0000:42:00.0: FW Tracer Owner

If I restart the driver as the install mentions:

[ 3852.805318] PKCS#7 signature not signed with a trusted key
[ 3852.806388] Compat-mlnx-ofed backport release: ee7aa0e
[ 3852.806389] Backport based on mlnx_ofed/mlnx-ofa_kernel-4.0.git ee7aa0e
[ 3852.806390] compat.git: mlnx_ofed/mlnx-ofa_kernel-4.0.git
[ 3852.812135] PKCS#7 signature not signed with a trusted key
[ 3852.826716] PKCS#7 signature not signed with a trusted key
[ 3852.837116] PKCS#7 signature not signed with a trusted key
[ 3852.875073] PKCS#7 signature not signed with a trusted key
[ 3852.882801] mlx5_core 0000:42:00.0: firmware version: 12.23.1020
[ 3852.882833] mlx5_core 0000:42:00.0: PCIe link speed is 8.0GT/s, device supports 8.0GT/s
[ 3852.882836] mlx5_core 0000:42:00.0: PCIe link width is x16, device supports x16
[ 3855.470717] mlx5_port_module_event: 5 callbacks suppressed
[ 3855.470724] mlx5_core 0000:42:00.0: Port module event: module 0, Cable plugged
[ 3856.536266] mlx5_core 0000:42:00.0: FW Tracer Owner
[ 3856.537406] PKCS#7 signature not signed with a trusted key

Any ideas?

Oh, the forums mentioned:

# cat ib_ipoib.conf
options ib_ipoib ipoib_enhanced=0

Which resulted in

[ 4128.198929] ib_ipoib: unknown parameter 'ipoib_enhanced' ignored

↧

Oracle vs Teradata vs Hadoop for Big Data

September 19, 2018, 11:51 pm

≫ Next: Re: rx-out-of-buffer

≪ Previous: Best IPoIB settings for a mixed environment of ConnectX-2,3,4 (MT27500, MT26428, MT27700)?

Let's examine for what volume and needs fits Oracle, Hadoop, NoSQL, Teradata.

1) Having small volume it's beneficial to use NoSQL if you are pity to spend 900$ for Oracle SE One.

The main advantage is price, NoSQL usually are free.

Small volume of data means no big difficulty of model and development.

2) For medium and big volume Oracle has a lot of advantages against Teradata and Hadoop

The main advantages are:

1.Very big mature of technologies and product, numbers of implementations against Teradata and Hadoop. Hadoop is young, Teradata is 30 years on the market, but has small number of implementations.

2.Very big batch of tools and facilities makes easier and faster development

3.Price in compare to Teradata.

4.Good scalability and speed, but has one bottle neck, it is a storage subsystem, it is one for all calculations servers.

So up to a certain limit Oracle has one of the best speed of data processing.

Shortage of Oracle:

Most fastest self-made massive of storing of data I saw is 18 Gb/s.

Exadata Full Rack provides 25 Gb/s (without compression, storage sell and cache optimization), costs around 5m.$

Teradata provides 34 Gb/s (without compression optimization), costs around 5m.$

Classer of 100 commodity servers provides 40 Gb/s, costs around 250 000$

Real-life example:

But it is usual that full scan in Oracle is not enough.

I tell you about Beeline that had in 2007 170 millions rows in one day in one table, it was all calls in Russia.

To analyze to read this big table it's impossible, you will never have enough speed of hard disks.

In this case use technology optimization by creating on this big table few aggregates in 4 millions rows per day.

And using this aggregates build reports.

Such optimization can be done in Oracle, Teradata, Hadoop

This technology optimization has 3 shortages:

1. If business users needs new field in report, you should stretch out this field from fact thought aggregate to report. It's very long time to develop.

2. If business users needs answer just now it's not possible. While you do development the question can be aged out.

3. Complicated ETL

Here to solve these 3 shortages Beeline decided to mode some tasks (especially ad-hoc) on Hadoop.

3) For extremely large data you can use Hadoop

It has advantages:

1. Almost endless scalability. You can get 25 or 125 or 1000 Gb/s

2. Price, everything is free besides of hardware of course.

Shortages:

1. Creating Map Reduce more difficult especially you already have trained SQL men. Creating Map Reduce as more difficult as more complex logic are.

So, to create Ad-hoc queries will not be so simple as in SQL.

2. Commodity server consume more energy supply and hold more space.

3. Hadoop on commodity servers needs double redundancy of data, more than for other servers.

4) For extremely large data on Teradata

Teradata much more deals with rough method work with data, such as full scan, than Oracle.

Teradata's ideology is shared-nothing and it looks like Hadoop. Data stored on spread-shared servers, each servers compute its own part.

Teradata's manual sharding is possible and even needs to be.

But Teradata has one significant drawbacks:

1. Quite poor tools. It's not comfort to work with in compare to Oracle. Oracle is a mature product, Teradata has some children's disease.

Price, Full Rack Teradata and Exadata prices are comparable, around 5 m.$

Also mention common point on Teradata and Hadoop(HBase). It needs to shard data around nodes. Data needs to fit in sharding evenly around all the nodes.

For example region is not a good candidate for Beeline. Moscow region holds 20% of data.

The benefit of Teradata is has in fact quad partitioning, Oracle has double. Having 170 millions quad or tripple partition is very good.

Limit of Teradata:

Teradata having such technologies as shared-nothing and network BYNET V5 has scalability up to 2048 nodes, 76Tb(10K) per node, in total 324Pb.

One Full Rack Exadata has more humble limit, maximum 672Tb(7.2K) or 200Tb(15K). To parallel Exadata is not so profitable, disk area is one for all!

If join two Exadata machines (Does it afford? theoretically yes) all will hang back to bandwidth of 40Gbit Network. Rather 1st Rack will have fast access to its own HDDs,

but slower to 2nd's Rack's HDDs, and vise verse.

We should also consider, that Teradata and Exadata have column and hybrid compression. Up to 4-6 times in a middle. But NoSQL databases also have it,

maybe not so advanced like in these monsters who spent a lot of money of it.

For complete picture it needs to mention:

1) Oracle has 3 caches - memory on Storage Cell, SDD PCI Flash cards and memory of main server.

Teradata has 1 cache - memory of nodes. But Teradata has know-how temperature storing!

Having this and consider MVCC Oracle is much more appropriate for OLTP load.

2) Exadata has smart Storage Cell, it can filter data before sending it to main server. Filtering on OS and firmware level of HDD.

Conclusion:

Oracle has most rich tools and speed of development will be faster.

Teradata also has powerful SQL language, but with few shortages, but has poor procedure language and tools.

Hadoop is a set of a very large number of tools, I will not describe them all. But one fast that Hadoop was developed for batch-processing with

long latency and nonexistence of joins tell

Roughly saying full-scan ratio looks like this:

1. Hadoop

2. Teradata

3. Oracle

So, consider what you need fast full scan or flexibility of product.

Teradata and Oracle can be comparable with train and icebreaker. Teradata is a train ride fast but only on rails, you cannot turn it to left or to right.

Oracle is slower on big data, but can ride everywhere, you can turn it to anyway, it will break any ice, it has thousands of features and optimizations.

Hadoop is like a rocket, it can be very fast, but only in a very specific tasks, on a very specific way.

If you don't have ad-hoc queries, all queries are known in advance and data less than 600TB, buy Oracle or Exadata.

If you have more data, it makes sense to think about Teradata or Hadoop depends of having joins and SQL.

↧

Re: rx-out-of-buffer

September 20, 2018, 12:22 am

≫ Next: Re: Line rate using Connect_X5 100G EN in Ubuntu; PCIe speed difference;

≪ Previous: Oracle vs Teradata vs Hadoop for Big Data

Thanks for the answer ! But I don't see what you refer as "the following Mellanox Community document" ? So I kind of still don't know what it is. If you refer to the line "Number of times receive queue had no software buffers allocated for the adapter's incoming traffic.", then the tuning you mention will not change the problem because the rings are never full, and the CPU is not busy. So what buffer does "rx-out-of-buffer" count if it's not the ring buffers?

↧

Re: Line rate using Connect_X5 100G EN in Ubuntu; PCIe speed difference;

September 20, 2018, 7:30 am

≫ Next: Re: Line rate using Connect_X5 100G EN in Ubuntu; PCIe speed difference;

≪ Previous: Re: rx-out-of-buffer

Hello,

Q1. Have you tried different benchmark like iperf?

Q2. PCIe 3.0 speed is 8GT/s, and PCIe 4.0 speed is 16GT/s, you have to get CPU/motherboard that supports PCIe 4.0 and AFAIK there is no such thing for Intel CPUs yet.

↧

Re: Line rate using Connect_X5 100G EN in Ubuntu; PCIe speed difference;

September 20, 2018, 7:35 am

≫ Next: ConnectX-4 works at FDR but not FDR10?

≪ Previous: Re: Line rate using Connect_X5 100G EN in Ubuntu; PCIe speed difference;

Throughput on PCIe 3.0 at a x16 lane is over 100Gb/s.

The card itself supports PCIe 4.0.

If you do the BIOS tuning (Performance Tuning for Mellanox Adapters , BIOS Performance Tuning Example ) and server tuning (Understanding PCIe Configuration for Maximum Performance , Linux sysctl Tuning ) you can hit line rate. Since you've got an EN card; update the firmware to the newest, and set a high MTU of 9000.

↧

ConnectX-4 works at FDR but not FDR10?

September 20, 2018, 9:50 am

≫ Next: How can I create segregated IPoIB networks on my SX6036?

≪ Previous: Re: Line rate using Connect_X5 100G EN in Ubuntu; PCIe speed difference;

I have 2 sets of servers with two tower workstations that are the head nodes, one tower for each server set. One set is 6 servers and the second is made up of 8 servers. The two towers are identical, hardware-wise. All machines use the same dual port ConnectX-4 cards. The server sets are connected to two Mellanox SX6018 switches with QSFP+ cables (one switch per server set) and the head node towers are connected to an SX6036G with QSFP+ cables. One port is used on each ConnectX-4 card. This configuration has been in use for over a year and we've switched between high speed ethernet with RDMA/ ROCE and FDR10 infiniband fabric multiple times with no issues. We recently switched to FDR infiniband for testing and everything worked fine, but when we switched back to FDR10 the head node towers would no longer pass data (MPI) to the srvers. We can ping from tower to server over the infiniband and ib_send_bw runs successfully between them with speeds at 38 Gbs, but MPI can't establish a connection from tower to server. The MPI software works fine from server to server. The MPI software has not changed from when it worked previously at FDR10 and this configuration works flawlessly when set to FDR but it does not work at the slower FDR10 configuration. The switches are set to auto-negotiate the fabric speed and switch reboots have not helped. Our customer dictates that we use FDR10 so we need to get this back and working at FDR10. Any suggestions?

↧

How can I create segregated IPoIB networks on my SX6036?

September 20, 2018, 9:59 am

≫ Next: Re: Remote VTEP mac learning is not working

≪ Previous: ConnectX-4 works at FDR but not FDR10?

I have two Infiniband networks running over multiple SX6005 (unmanaged) switches. The two networks are physically separate and require a subnet manager to be be running on each network. The subnet managers are running on a two dell optiplex boxes outfittted with Infiniband cards (it was a test system that made it's way into production). I'd like to move both subnets onto the SX6036 switch and maintain the segregation between the Infiniband networks. All we run over the Infiniband networks are gluster and iSCSI.

Using the web GUI, under IB SM Mgmt I see an option for "partitions". Is this what I should be looking into or is there a better way to achieve what I'm wanting to do?

↧

Re: Remote VTEP mac learning is not working

September 17, 2018, 8:09 am

≫ Next: Re: Line rate using Connect_X5 100G EN in Ubuntu; PCIe speed difference;

≪ Previous: How can I create segregated IPoIB networks on my SX6036?

Hi Maheedhara,

Please open a support ticket with Mellanox Support supportadmin@mellanox.com

Thanks,

Praitk Pande

↧

Re: Line rate using Connect_X5 100G EN in Ubuntu; PCIe speed difference;

September 20, 2018, 3:35 pm

≫ Next: Re: Line rate using Connect_X5 100G EN in Ubuntu; PCIe speed difference;

≪ Previous: Re: Remote VTEP mac learning is not working

Before putting the post, I have adhered to Perf. Tuning fo Mellanox Adapters, BIOS tuning and Linux sysctl Tuning.

What I couldn't understand is why I am not able to get 16GT/s speed in the PCIe - but only able to use 8GT/s. Any help/pointers would be highly appreciated.

I feel the queues in the receiving machine are getting full or something, not sure how I can tune them. I say this because, ethtool suggests there is no loss in packets.

↧

Re: Line rate using Connect_X5 100G EN in Ubuntu; PCIe speed difference;

September 21, 2018, 2:33 am

≫ Next: VLANs port Virtualization

≪ Previous: Re: Line rate using Connect_X5 100G EN in Ubuntu; PCIe speed difference;

You have to get a motherboard that supports PCIe 4.0 to get 16GT/s (and that's about 250Gbps at x16)

↧

VLANs port Virtualization

September 22, 2018, 7:31 am

≫ Next: Re: Add iPXE support for Connectx-3-Pro MT27520

≪ Previous: Re: Line rate using Connect_X5 100G EN in Ubuntu; PCIe speed difference;

Hello guys.

We acquired a ConnectX3 to test port virtualization but we fell into the following doubt ...

We want to create a trunk on one of the virtualized ports, how to do this in the Mellanox switch, if it sees only the physical port?

Another question, how can we separate each of these virtualized ports into VLANs?

Our idea is to virtualize the port and use one for iSCSI, one for backup and another as trunk, and deliver to the VMs all these virtualized ports on top of a physical port.

Thank you.

↧

Re: Add iPXE support for Connectx-3-Pro MT27520

September 24, 2018, 3:01 am

≫ Next: NVMeoF with ESX 6.5

≪ Previous: VLANs port Virtualization

Also, I couldn't find the connectx-3 pro cards in the hardware list that expected to work: https://ipxe.org/appnote/hardware_drivers

↧

NVMeoF with ESX 6.5

September 24, 2018, 8:19 am

≫ Next: Re: How to enable the debuginfo for the libraries of OFED?

≪ Previous: Re: Add iPXE support for Connectx-3-Pro MT27520

Hi, Does any one successfully created a NVMeoF target using a block device in ESX 6.5? If it is even supported? If so can you guide me to setup instructions.

Thanks

Anil

↧

Re: How to enable the debuginfo for the libraries of OFED?

September 14, 2018, 8:01 am

≫ Next: Re: Mac address on routed ports belonging to separate VRFs are same

≪ Previous: NVMeoF with ESX 6.5

Instead of re-compiling RPMS, I would suggest to have a folder with sources of the libraries compiled with debug symbols and use LD_LIBRARY_PATH to load files from this folder. That gives you much more flexibility when debugging the issue as you can make the changes on the fly, recompile in a second and use it. Also it doesn't require root access in order to install RPM. You might use newer version of libibverbs, libxml5 and other libraries in that way.

The general instruction is to open a RPM, extract archive, configure with customer prefix and run make install. For run time, use LD_LIBRARY_PATH. For example,

$rpm2cio -id <RPM>

$tar xf <TAR>

$cd <CREATED FOLDER>

$./autogen.sh ( if necessary)
$./configure --prefix=<PREFIX> --enable-debug

$make -j <N> install

export LD_LIBRARY_PATH=<PREFIX>/lib (or lib64)

↧

Re: Mac address on routed ports belonging to separate VRFs are same

September 25, 2018, 3:19 am

≫ Next: ConnectX-3 VFs communication

≪ Previous: Re: How to enable the debuginfo for the libraries of OFED?

With any way can we check arp table and mac address-table in vrf instances? is this possible? when I tried to see the mac address and details through switch 3560, I was unable to see other detail. also, check out Belkin n300 Router Support for Belkin issues.

↧

ConnectX-3 VFs communication

September 25, 2018, 9:41 am

≫ Next: Re: How can I enable "packet pacing" on connectX-5 ?

≪ Previous: Re: Mac address on routed ports belonging to separate VRFs are same

Hello.

We created the VFs in a ConnectX-3.

The physical interface communicates, but not virtual interfaces, with or without VLANs.

Do you have any additional configuration to use VFs?

Thank you.

↧

Re: How can I enable "packet pacing" on connectX-5 ?

September 25, 2018, 6:34 pm

≫ Next: Re: How to enable the MT4103(connectx-3 pro) physical port

≪ Previous: ConnectX-3 VFs communication

Hi Patrick,

Thank you for posting your question on the Mellanox Community.

Packet Pacing is support on the ConnectX-5 from firmware version 16.20.1010 and higher. You can use the following Mellanox Community Document to configure 'Packet Pacing'-> https://community.mellanox.com/docs/DOC-2479

Even though the document mentions only ConnectX-4, the ConnectX-5 uses the same driver.

If you experiencing any new issues after enabling 'Packet Pacing', please open a Mellanox Support Case by sending an email to support@mellanox.com

Thanks and regards,

~Mellanox Technical Support

↧

Re: How to enable the MT4103(connectx-3 pro) physical port

September 25, 2018, 6:55 pm

≫ Next: ways to take best Cisco SFP+ modules

≪ Previous: Re: How can I enable "packet pacing" on connectX-5 ?

Hello Liang,

Thank you for posting your question on the Mellanox Community.

Based on the information provided, we see that you correctly configure RoCEv2 on the adapter. Based on the 'ibstat' output, there is no physical link to the switch.

Please check the cable used, make sure it is a validated and supported cable based on the latest Release Notes of the ConnectX-3 Pro firmware in use.

Also check, if the port on the switch is disabled. If needed move the link to another switch port. Also on the adapter, please perform a back-to-back between the two adapters to see if the link comes up.

If after this the issue is not resolved, please open a Mellanox Support case by sending an email to support@mellanox.com.

Thanks and regards,
~Mellanox Technical Support

↧

ways to take best Cisco SFP+ modules

September 25, 2018, 11:42 pm

≫ Next: sending order of 'segmented' UDP packets

≪ Previous: Re: How to enable the MT4103(connectx-3 pro) physical port

If you need to shop for Cisco SFP+ modules, it’s sensible to place their highest quality transmission distance and compatibility with other Cisco gadgets into attention.

As for the transmission distance, basically, the degrees between 100 m to 400 m and 10 km to 80 km are commonly visible. For the gap from one hundred m to four hundred m, we usually use Cisco 10G multimode SFP+ transceiver. For instance, in case you need to shop for a Cisco SFP+ module for transmission inside 300 m, then the Cisco SFP-10G-SR module will be the pleasant preference. More facts approximately the most beneficial transmission distance of Cisco SFP+ modules, you may refer to the table above.

Apart from distance, any other vital component you need to be clear approximately is the SFP+ module’s compatibility with different Cisco gadgets. You may wonder if this Cisco SFP+ Module can connect with different gadgets, which include SFP modules. The solution is NO. For example, If you connect the SFP-10G-SR with Cisco GLC-SX-MMD SFP transceiver (1 Gbps most effective), they will now not be able to paintings. Since the SFP-10G-SR most effective runs at 10 Gbps hyperlink fee, it approach you pressure SFP-10G-SR to use 1Gbps speed. You can in no way interconnect them. For more information about the compatibility of Cisco modules, you could search it from their on line Compatibility Matrix.

By the manner, Cisco SFP+ module charge occasionally is also difficult for many shoppers. When you search Cisco SFP+ modules , you may locate the price of those modules from original logo shops isn’t always reasonably-priced. Therefore, in recent years, using non-original brand shops optical transceivers in fiber optic community has been a fashion. More and extra users decide on 1/3-celebration modules, like 10Gtek, pluggable optics as they are confident to be completely like minded with the unique brand hardware as well as having a cheap price.

Conclusion

As a main position in fiber optic network, Cisco SFP+ module has been witnessed a glorious period. But with the unremitting efforts of other manufacturers, surely, other non-authentic emblem shops manufacturers could be visible on the rise. For certain, the 10gtek might be that extraordinary case.

Weekly sale on 10gtek:

cclearance up to 40% 40G Qsfp+ AOC

https://www.sfpcables.com/40gbase-qsfp-aoc-cable-1-5-meter $59

https://www.sfpcables.com/40gbase-qsfp-aoc-cable-1-meter $59

Valid until 14th Oct.

↧

sending order of 'segmented' UDP packets

September 26, 2018, 1:44 am

≫ Next: Re: Dell M1000e blade server, InfiniBand QDR subnet issue, OFED 4.4, opensm initialization error!

≪ Previous: ways to take best Cisco SFP+ modules

Hi,

when creating an UDP packet, I need to use two mbufs - one containing the UDP header (hdr) and another holding the payload (pay):

struct rte_mbuf* hdr = rte_pktmbuf_alloc(hdrmp);

struct rte_mbuf* pay = rte_pktmbuf_alloc(paymp);

// filling ether, IP, UDP header

...

ip_hdr->version_ihl = 0x40 | 0x05; // (*) without 0x05 order it ok

...

// setting sizes and linkage

hdr->data_len = sizeof(struct ether_hdr) + sizeof(struct ipv4_hdr) + sizeof(struct udp_hdr);

pay->data_len = payloadSize;

hdr->pkt_len = hdr->data_len + pay->data_len;

pay->pkt_len = hdr->pkt_len;

hdr->next = pay;

hdr->nb_segs = 2;

When sending plenty of such UDP packets using rte_eth_tx_burst(), all of them were sent correctly, but the sending order seems to be random. When using just a single mbuf for an UDP packet, the sending order is always the order of the packets in the tx array - which is what I expect. Using the header-payload-separation approach and omitting the IP header size info in the ip_hdr field - resulting in a wrong IP packet - the sending order gets ok.

I'm using the mlx5 PMD, NIC is a ConnectX-5. Could it be some offload mechanisms, influencing the sending order? Maybe someone can help.

Thanks and best regards

Sofia Baran

↧