
Purpose of this blog is to provide a reference guide explaining all basics of configuration/troubleshooting issues pertaining to Ovs Hw offload + Openstack + Mellanox smart NICs
Lets begin.
Building blocks:
Below drawing depicts the building blocks involved when ovs hw offload deployed with Openstack. This feature is available as Tech Preview since Redhat OSP13 (community version Queen).

Lets understand each one of them.
Nova
Nova scheduler doesn’t have any changes while dealing with offload feature and behaves the same as SR-IOV pass through for instance spawning. Nova scheduler should be configured to use the PciPassthroughFilter and Nova compute should be configured with passthrough_whitelist. Os-vif would figure out the representor port name which is mapped to VF port plugged to VM and plugs representor port to br-int.
Neutron
When offload enabled, NIC e-switch mode changes to “switchdev”. This would create representor ports on NIC and are mapped/corresponds to each created VFs. Representator ports are plugged into the host and take packets to and from VFs to the kernel switching.
In order to allocate a port from switchdev enabled NIC, neutron port needs to be created with a binding profile mentioning capabilities as below.
$ openstack port create –network private –vnic-type=direct –binding-profile ‘{“capabilities”: [“switchdev”]}’ direct_port1 |
Admin pass this port information while creating the instance.Representator port gets associated with instance VF interface and plugged to ovs bridge br-int for one time ovs datapath processing.
OVS
In an offload environment, the first packet traverses through the ovs kernel data path. Please refer to the “Packet journey” section for graphical representation. The first traverse enables the ml2 ovs to figure out rules for incoming/outgoing packets for the instance traffic. Once all flows pertaining to a particular traffic stream are formed, ovs will use TC flower utility to push and program these flows on NIC hardware.
We are required to apply the following configuration on ovs in order to enable hardware offload.
$ sudo ovs-vsctl set Open_vSwitch . other_config:hw-offload=true |
TC subsystems
Ovs uses TC data path when hw-offload flag is enabled. TC flower is an iproute2 utility and used to write data path flows on hardware.
We are required to apply below config on ovs. Also this is the default option if tc-policy is not explicitly configured.
$ sudo ovs-vsctl set Open_vSwitch . other_config:tc-policy=both |
This would make sure that when flow is programmed it would be programmed on both hardware and software data path. If a hardware program fails, traffic can still be steered through the tc data path but with performance cost.
NIC drivers (PF and VF)
Mellanox ConnectedX-5 uses mlx5_core as its PF and VF driver. This driver generally takes care of table creation on hardware, flow handling, switchdev config, block device creation etc. devlink tool used to set the mode of the pci device Like below.
$ sudo devlink dev eswitch set pci/0000:03:00.0 mode switchdev |
NIC firmware
NIC firmware is a non-volatile memory software and persists its configuration across reboots. However, it is runtime modifiable. It generally takes care of maintaining tables/rules, fixing pipeline of tables, hardware resource management, VFs creations. Along with the driver, we need firmware support as well for any feature to work smoothly.
We need to apply the following configuration on interface to enable tc flower pushed flow programing at port level.
$ sudo ethtool -K enp3s0f0 hw-tc-offload on |
Packet journey:

Note:
- Flow illustrated here is just an example. Actual flows would contain more match and action fields like vxlan tunnel, vlan encap/decap etc.
- Packet flow remains the same for VLAN and VxLAN once flow is programmed on the NIC. In the case of VxLAN, br-tun will process the encap/decap of VxLAN transport layer header for 1st packet processing.
- For each direction of traffic a flow will be programmed. I.e. traffic sending/receiving instances will have 2 flows programmed for both directions.
- If the action is not supported, the rule cannot be offloaded. Similarly When the fields cannot be identified, the rule cannot be offloaded. For a list of supported classifiers and actions go here.
A sample output capturing of ping traffic from an instance shown in Flow dissection section below.
Basic troubleshoot:
Feature supported versions
Linux Kernel >= 4.13
Open vSwitch >= 2.8
openstack >= Pike version
iproute >= 4.12
Mellanox NICs FW
FW ConnectX-5: >= 16.21.0338
FW ConnectX-4 Lx: >= 14.21.0338
Note: Mellanox ConnectX-4 NIC supports only VLAN Offload. Mellanox ConnectX-4 Lx/ConnectX-5 NICs support both VLAN/VXLAN Offload.
Supported bond types:
Bonding configurations validated are
- active-backup – mode=1
- active-active or balance-xor – mode=2
- 802.3ad (LACP) – mode=4
Currently supported classifiers(Ovs 2.11):
in_port | Eth_type=any | Eth_type=VLAN | Eth_type=MPLS | IP Proto=IPv4/IPv6 | IP Proto=TCP/UDP | IP Proto=SCTP | Tunnel |
VLAN TPID | MPLS Label stack | IP ToS | TCP src port | src port | Tunnel IPv4 src | ||
VLAN ID | TTL | TCP dst port | dst port | Tunnel IPv4 dst | |||
L2 dst mac | Fragmentation | TCP flags | Tunnel ipv6 dst | ||||
VLAN PCP | IPv4 src | UDP src port | Tunnel ipv6 src | ||||
L2 src mac | IPv4 dst | Tunnel Tos | |||||
IPv6 src | Tunnel TTL | ||||||
IPv6 dst | Tunnel dst port |
Currently supported Actions(Ovs 2.11):
VLAN POP | Encapsulation | Output | drop |
VLAN Push | Tunnel id | ||
VLAN TPID | IPV4 src | ||
VLAN TCID | IPv4 dst | ||
VLAN PCP | ipv6 src | ||
VLAN CFI | ipv6 dst | ||
ToS | |||
TTL | |||
L4 dst |
Flow dissection
Offloaded flows functionality is very similar to what hardware switches/routers do with ASIC chipsets. Once routing/forwarding decision made and written on ASIC, ASIC takes care of matching and action of incoming packets and provides wire speed processing. Routers/Switches have provision to get into ASIC shell and figure out its table entries when debugging of that sort is required.
Take a look to below broadcom chipset output taken from cumulus linux running switch.
root@dni-7448-26:~# cl-bcmcmd l2 show
mac=00:02:00:00:00:08 vlan=2000 GPORT=0x2 modid=0 port=2/xe1
mac=00:02:00:00:00:09 vlan=2000 GPORT=0x2 modid=0 port=2/xe1 Hit
However, NIC vendors are not providing such access and leave wondering troubleshooters to guess what might be wrong at NIC hardware when flow is not offloaded.
With this limitation, operators left to trust ovs and tc to make sure flow formulation was accordance with forward policies and pushed to hardware.
Let’s dissect a sample flow taken from ovs datapath table and try to understand it.
Below is a transactional pair of flows. 1st flow where packet from outside network hit bond interface (PFs bonded) and pushed to representort port (eventually ingress to VM) while 2nd flow where packet from representor port (Egress from VM) goes out via bond.
ufid:6ef8818a-17df-4714-9010-1203cbc163c9, skb_priority(0/0),skb_mark(0/0),in_port(bond-pf),packet_type(ns=0/0,id=0/0),eth(src=fa:16:3e:51:c3:7c,dst=f8:f2:1e:03:bf:e4),eth_type(0x8100),vlan(vid=398,pcp=0),encap(eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=0.0.0.0/0.0.0.0,proto=0/0,tos=0/0,ttl=0/0,frag=no)), packets:11, bytes:1078, used:0.080s, offloaded:yes, dp:tc, actions:pop_vlan,eth3 ufid:724ad8e3-2f49-4aa9-8ea7-b5331fba2a1f, skb_priority(0/0),skb_mark(0/0),in_port(eth3),packet_type(ns=0/0,id=0/0),eth(src=f8:f2:1e:03:bf:e4,dst=fa:16:3e:51:c3:7c),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=0.0.0.0/0.0.0.0,proto=0/0,tos=0/0,ttl=0/0,frag=no), packets:11, bytes:1122, used:0.080s, offloaded:yes, dp:tc, actions:push_vlan(vid=398,pcp=0),bond-pf |
Ufid: unique flow identifier
Skb_priority: this qdisc field is not written on hardware. It has a mask value /0 which makes it ignorable when processing packets.
Skb_mark: this qdisc field is not written on hardware. It has a mask value /0 which makes it ignorable when processing packets.
In_port: port of incoming packet
Packet_type: There are many packet types supported by ovs. However while offloading this is not written on hardware and its mask value too makes it relevant when processing packets.
Eth_type: offload supports all eth_type and is written to hardware. However except vlan and mpls all other eth_type’s inside fields not used for matching againsts the packet. Forwarding is done only by eth_type and not other fields. I.e. if eth_type = 0x0806, then arp headers are not used for matching actions.
Eth (src,dst): Layer 2 ethernet source and mac address
Vlan: vlan id and pcp (priority code points)
ipv4(src/dst): since the mask here is /0, these are ignored and any ipv4 src, dst packet processed by this flow.
ipv4(proto,tos,tt,frag): protocol of next header, Tos byte, time to live and fragment offset.
Statistics: Numbers here are pulled from interfaces.
Used: time since offloaded
Offloaded: ovs side confirmation
dp: tc: – data path used here is tc
Actions:
Pop_vlan: remove vlan header
Output: send to egress interface
Let’s take a look at flow with arp.
ufid:31ec10fa-ffdc-454b-867f-f673dae77ea6, skb_priority(0/0),skb_mark(0/0),in_port(bond-pf),packet_type(ns=0/0,id=0/0),eth(src=fa:16:3e:51:c3:7c,dst=f8:f2:1e:03:bf:e4),eth_type(0x8100),vlan(vid=398,pcp=0),encap(eth_type(0x0806),arp(sip=0.0.0.0/0.0.0.0,tip=0.0.0.0/0.0.0.0,op=0/0,sha=00:00:00:00:00:00/00:00:00:00:00:00,tha=00:00:00:00:00:00/00:00:00:00:00:00)), packets:1, bytes:56, used:4.950s, offloaded:yes, dp:tc, actions:pop_vlan,eth3 ufid:7329344d-b5f9-4b03-b26a-a9c7318b24b7, skb_priority(0/0),skb_mark(0/0),in_port(eth3),packet_type(ns=0/0,id=0/0),eth(src=f8:f2:1e:03:bf:e4,dst=fa:16:3e:51:c3:7c),eth_type(0x0806),arp(sip=0.0.0.0/0.0.0.0,tip=0.0.0.0/0.0.0.0,op=0/0,sha=00:00:00:00:00:00/00:00:00:00:00:00,tha=00:00:00:00:00:00/00:00:00:00:00:00), packets:0, bytes:0, used:5.150s, offloaded:yes, dp:tc, actions:push_vlan(vid=398,pcp=0),bond-pf |
Here, the entire arp header (sender ip/hardware address, target ip/hardware address) is masked with 0, so only eth_type=0x086 should be used to make decisions and offloaded to hardware.
Below is offloaded flow with VxLAN.
ufid:d56153f3-f8c7-4e8a-a6ce-3ddeb73cf27a, skb_priority(0/0),tunnel(tun_id=0x3d,src=10.10.147.101,dst=10.10.147.104,ttl=0/0,tp_dst=4789,flags(+key)),skb_mark(0/0),in_port(vxlan_sys_4789),packet_type(ns=0/0,id=0/0),eth(src=fa:16:3e:45:a9:e2,dst=f8:f2:1e:03:bf:e6),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=0.0.0.0/0.0.0.0,proto=0/0,tos=0/0,ttl=0/0,frag=no), packets:19, bytes:1862, used:0.080s, offloaded:yes, dp:tc, actions:eth0 ufid:3bcda9c2-2d36-4692-82c2-aa631f015560, skb_priority(0/0),skb_mark(0/0),in_port(eth0),packet_type(ns=0/0,id=0/0),eth(src=f8:f2:1e:03:bf:e6,dst=fa:16:3e:45:a9:e2),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=0.0.0.0/0.0.0.0,proto=0/0,tos=0/0x3,ttl=0/0,frag=no), packets:18, bytes:2736, used:0.090s, offloaded:yes, dp:tc, actions:set(tunnel(tun_id=0x3d,src=10.10.147.104,dst=10.10.147.101,ttl=64,tp_dst=4789,flags(key))),vxlan_sys_4789 |
Tunnel type depends on the type of interface Encapsulation. It supports VXLAN / NVGRE / RAW and GENEVE.
flags(key) are ignored and not offloaded to hardware.
System validations
- Ensure SR-IOV and VT-d are enabled on the system.
- Enable IOMMU in Linux by adding intel_iommu=on to kernel parameters, for example, using GRUB.
Debugs:
- Ovs debugs are not helpful currently. Enhancement of debuggability of flow programming however fixed in the latest version. Though We can enable debug on ovs as below.
$ ovs-appctl vlog/set dpif_netlink:file:DBG;
Logs are printed at /var/log/openvswitch/ovs-vswitchd.log.
For example, following is a capture of failed flow programming.
# less /var/log/openvswitch/ovs-vswitchd.log | grep -Ei “ERR|syndrome”
2019-02-19T12:48:46.126Z|00001|dpif_netlink(handler1)|ERR|failed to offload flow: Operation not supported
Parallelly checking ovsdb logs too shall be helpful. They are available at /var/log/openvswitch/ovsdb-server.log
Mellanox has provided a system information script which generally capture all the required information from the system. It is available at
https://github.com/Mellanox/linux-sysinfo-snapshot/blob/master/sysinfo-snapshot.py
And can be run as
# ./sysinfo-snapshot.py –asap –asap_tc –ibdiagnet –openstack
Note: If the action is not supported, the rule cannot be offloaded. Similarly When the fields cannot be identified, the rule cannot be offloaded.
CLIs:
$ovs-dpctl dump-flows -m type=offloaded
$tc filter show dev ens1_0 ingress
$tc -s filter show dev ens1_0 ingress
$ tc monitor
Ethtool provides more insights.
Tov view number of channels: ethtool -l <uplink representor>
To check Statistics: ethtool -I <uplink/VFs>
To see driver Information: ethtool -i <uplink rep>
To check ring sizes: ethtool -g <uplink rep>
Enabled features: ethtool -k <uplink/VFs>
Parellelly, we can check the traffic flow using tcpdump at representator port and PF ports.
Notes
- Admin operations on VF’s link state controlled by VF rep link state. That means, any changes to link state of representor port, effects link state of VF.
- Representator port statistics presents statistics of VF as well.
Limitations:
- “openvswitch” firewall driver cannot be used with offload as connection tracking properties of the flows are not supported in offload path. The default “iptables_hybrid” or “none” is supported. Patches have been merged to Ovs 2.13 to support connection tracking.
Hi,i am jack chen,
I have created a VF and created a VM using OpenStack. I have created VF lag by following the instructions provided in the Mellanox official guide.
1、 echo 12 > /sys/class/net/enp4s0f0/device/sriov_numvfs
echo 12 > /sys/class/net/enp4s0f1/device/sriov_numvfs
2、 Unbind the VFs.
echo 0000:04:00.2 > /sys/bus/pci/drivers/mlx5_core/unbind
echo 0000:04:00.3 > /sys/bus/pci/drivers/mlx5_core/unbind
…
3、 devlink dev eswitch set pci/0000:3b:00.0 mode switchdev
devlink dev eswitch set pci/0000:3b:00.1 mode switchdev
4、Run the openvswitch service.
5、 systemctl start openvswitch
6、 ovs-vsctl add-br ovs-sriov
7、 ovs-vsctl set Open_vSwitch . other_config:hw-offload=true
8、 systemctl restart openvswitch
9、configuration SRIOV VF lag
modprobe bonding mode=802.3ad
Ifup bond0 (make sure ifcfg file is present with desired bond configuration)
ip link set enp4s0f0 master bond0
ip link set enp4s0f1 master bond0
ovs-vsctl add-port ovs-sriov bond0
ip link set dev bond0 up
The following information is displayed in the kernel log when a bond is created:
mlx5_cmd_check:795:(pid 40329): CREATE_LAG(0x840) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0x2c5b14)
mlx5_create_lag:321:(pid 40329): Failed to create LAG (-22)
mlx5_activate_lag:360:(pid 40329): Failed to activate VF LAG
Make sure all VFs are unbound prior to VF LAG activation or deactivation
In addition, data packets sent from VMs cannot be distributed to different network ports.
LikeLike