Lets understand the openvswitch hardware offload!

Photo by Brett Sayles on Pexels.com

Purpose of this blog is to provide a reference guide explaining all basics of configuration/troubleshooting issues pertaining to Ovs Hw offload + Openstack + Mellanox smart NICs

Lets begin.

Building blocks:

Below drawing depicts the building blocks involved when ovs hw offload deployed with Openstack. This feature is available as Tech Preview since Redhat OSP13 (community version Queen). 

Lets understand each one of them.

Nova

Nova scheduler doesn’t have any changes while dealing with offload feature and behaves the same as SR-IOV pass through for instance spawning. Nova scheduler should be configured to use the PciPassthroughFilter and Nova compute should be configured with passthrough_whitelist. Os-vif would figure out the representor port name which is mapped to VF port plugged to VM and plugs representor port to br-int.

Neutron

When offload enabled, NIC e-switch mode changes to “switchdev”. This would create representor ports on NIC and are mapped/corresponds to each created VFs. Representator ports are plugged into the host and take packets to and from VFs to the kernel switching. 

In order to allocate a port from switchdev enabled NIC, neutron port needs to be created with a binding profile mentioning capabilities as below.

$ openstack port create –network private –vnic-type=direct –binding-profile ‘{“capabilities”: [“switchdev”]}’ direct_port1

Admin pass this port information while creating the instance.Representator port gets associated with instance VF interface and plugged to ovs bridge br-int for one time ovs datapath processing.

OVS

In an offload environment, the first packet traverses through the ovs kernel data path. Please refer to the “Packet journey” section for graphical representation. The first traverse enables the ml2 ovs to figure out rules for incoming/outgoing packets for the instance traffic. Once all flows pertaining to a particular traffic stream are formed, ovs will use TC flower utility to push and program these flows on NIC hardware.

We are required to apply the following configuration on ovs in order to enable hardware offload.   

$ sudo ovs-vsctl set Open_vSwitch . other_config:hw-offload=true

TC subsystems

Ovs uses TC data path when hw-offload flag is enabled. TC flower is an iproute2 utility and used to write data path flows on hardware.  

We are required to apply below config on ovs. Also this is the default option if tc-policy is not explicitly configured. 

$ sudo ovs-vsctl set Open_vSwitch . other_config:tc-policy=both

This would make sure that when flow is programmed it would be programmed on both hardware and software data path. If a hardware program fails, traffic can still be steered through the tc data path but with performance cost. 

NIC drivers (PF and VF)

Mellanox ConnectedX-5 uses mlx5_core as its PF and VF driver. This driver generally takes care of table creation on hardware, flow handling, switchdev config, block device creation etc. devlink tool used to set the mode of the pci device Like below. 

$ sudo devlink dev eswitch set pci/0000:03:00.0 mode switchdev

NIC firmware

NIC firmware is a non-volatile memory software and persists its configuration across reboots. However, it is runtime modifiable. It generally takes care of maintaining tables/rules, fixing pipeline of tables, hardware resource management, VFs creations. Along with the driver, we need firmware support as well for any feature to work smoothly. 

We need to apply the following configuration on interface to enable tc flower pushed flow programing at port level.

$ sudo ethtool -K enp3s0f0 hw-tc-offload on

Packet journey:

Note:

  • Flow illustrated here is just an example. Actual flows would contain more match and action fields like vxlan tunnel, vlan encap/decap etc.
  • Packet flow remains the same for VLAN and VxLAN once flow is programmed on the NIC. In the case of VxLAN, br-tun will process the encap/decap of VxLAN transport layer header for 1st packet processing.
  • For each direction of traffic a flow will be programmed. I.e. traffic sending/receiving instances will have 2 flows programmed for both directions. 
  • If the action is not supported, the rule cannot be offloaded. Similarly When the fields cannot be identified, the rule cannot be offloaded. For a list of supported classifiers and actions go here.

A sample output capturing of ping traffic from an instance shown in Flow dissection section below. 

Basic troubleshoot:

Feature supported versions

Linux Kernel >= 4.13

Open vSwitch >= 2.8

openstack >= Pike version

iproute >= 4.12

Mellanox NICs FW

FW ConnectX-5: >= 16.21.0338

FW ConnectX-4 Lx: >= 14.21.0338

Note: Mellanox ConnectX-4 NIC supports only VLAN Offload. Mellanox ConnectX-4 Lx/ConnectX-5 NICs support both VLAN/VXLAN Offload.

Supported bond types:

Bonding configurations validated are 

  • active-backup – mode=1
  • active-active or balance-xor – mode=2
  • 802.3ad (LACP) – mode=4

Currently supported classifiers(Ovs 2.11):

in_portEth_type=any Eth_type=VLANEth_type=MPLSIP Proto=IPv4/IPv6IP Proto=TCP/UDPIP Proto=SCTPTunnel
VLAN TPIDMPLS Label stackIP ToSTCP src portsrc portTunnel IPv4 src
VLAN IDTTLTCP dst portdst portTunnel IPv4 dst
L2 dst macFragmentationTCP flagsTunnel ipv6 dst
VLAN PCPIPv4 srcUDP src portTunnel ipv6 src
L2 src macIPv4 dstTunnel Tos
IPv6 srcTunnel TTL
IPv6 dstTunnel dst port

Currently supported Actions(Ovs 2.11):

VLAN POPEncapsulationOutputdrop
VLAN PushTunnel id
VLAN TPIDIPV4 src
VLAN TCIDIPv4 dst
VLAN PCPipv6 src
VLAN CFIipv6 dst
ToS
TTL
L4 dst

Flow dissection

Offloaded flows functionality is very similar to what hardware switches/routers do with ASIC chipsets. Once routing/forwarding decision made and written on ASIC, ASIC takes care of matching and action of incoming packets and provides wire speed processing. Routers/Switches have provision to get into ASIC shell and figure out its table entries when debugging of that sort is required. 

Take a look to below broadcom chipset output taken from cumulus linux running switch. 

root@dni-7448-26:~# cl-bcmcmd l2 show

mac=00:02:00:00:00:08 vlan=2000 GPORT=0x2 modid=0 port=2/xe1

mac=00:02:00:00:00:09 vlan=2000 GPORT=0x2 modid=0 port=2/xe1 Hit

However, NIC vendors are not providing such access and leave wondering troubleshooters to guess what might be wrong at NIC hardware when flow is not offloaded. 

With this limitation, operators left to trust ovs and tc to make sure flow formulation was accordance with forward policies and pushed to hardware. 

Let’s dissect a sample flow taken from ovs datapath table and try to understand it. 

Below is a transactional pair of flows. 1st flow where packet from outside network hit bond interface (PFs bonded) and pushed to representort port (eventually ingress to VM) while 2nd flow where packet from representor port (Egress from VM) goes out via bond. 

ufid:6ef8818a-17df-4714-9010-1203cbc163c9, skb_priority(0/0),skb_mark(0/0),in_port(bond-pf),packet_type(ns=0/0,id=0/0),eth(src=fa:16:3e:51:c3:7c,dst=f8:f2:1e:03:bf:e4),eth_type(0x8100),vlan(vid=398,pcp=0),encap(eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=0.0.0.0/0.0.0.0,proto=0/0,tos=0/0,ttl=0/0,frag=no)), packets:11, bytes:1078, used:0.080s, offloaded:yes, dp:tc, actions:pop_vlan,eth3
ufid:724ad8e3-2f49-4aa9-8ea7-b5331fba2a1f, skb_priority(0/0),skb_mark(0/0),in_port(eth3),packet_type(ns=0/0,id=0/0),eth(src=f8:f2:1e:03:bf:e4,dst=fa:16:3e:51:c3:7c),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=0.0.0.0/0.0.0.0,proto=0/0,tos=0/0,ttl=0/0,frag=no), packets:11, bytes:1122, used:0.080s, offloaded:yes, dp:tc, actions:push_vlan(vid=398,pcp=0),bond-pf

Ufid: unique flow identifier 

Skb_priority: this qdisc field is not written on hardware. It has a mask value /0 which makes it ignorable when processing packets.

Skb_mark: this qdisc field is not written on hardware. It has a mask value /0 which makes it ignorable when processing packets.

In_port: port of incoming packet

Packet_type: There are many packet types supported by ovs. However while offloading this is not written on hardware and its mask value too makes it relevant when processing packets.

Eth_type: offload supports all eth_type and is written to hardware. However except vlan and mpls all other eth_type’s inside fields not used for matching againsts the packet. Forwarding is done only by eth_type and not other fields. I.e. if eth_type = 0x0806, then arp headers are not used for matching actions.

Eth (src,dst): Layer 2 ethernet source and mac address

Vlan: vlan id and pcp (priority code points)

ipv4(src/dst): since the mask here is /0, these are ignored and any ipv4 src, dst packet processed by this flow. 

ipv4(proto,tos,tt,frag): protocol of next header, Tos byte, time to live and fragment offset. 

Statistics: Numbers here are pulled from interfaces.

Used: time since offloaded

Offloaded: ovs side confirmation

dp: tc: – data path used here is tc 

Actions:

Pop_vlan: remove vlan header

Output: send to egress interface

Let’s take a look at flow with arp.

ufid:31ec10fa-ffdc-454b-867f-f673dae77ea6, skb_priority(0/0),skb_mark(0/0),in_port(bond-pf),packet_type(ns=0/0,id=0/0),eth(src=fa:16:3e:51:c3:7c,dst=f8:f2:1e:03:bf:e4),eth_type(0x8100),vlan(vid=398,pcp=0),encap(eth_type(0x0806),arp(sip=0.0.0.0/0.0.0.0,tip=0.0.0.0/0.0.0.0,op=0/0,sha=00:00:00:00:00:00/00:00:00:00:00:00,tha=00:00:00:00:00:00/00:00:00:00:00:00)), packets:1, bytes:56, used:4.950s, offloaded:yes, dp:tc, actions:pop_vlan,eth3
ufid:7329344d-b5f9-4b03-b26a-a9c7318b24b7, skb_priority(0/0),skb_mark(0/0),in_port(eth3),packet_type(ns=0/0,id=0/0),eth(src=f8:f2:1e:03:bf:e4,dst=fa:16:3e:51:c3:7c),eth_type(0x0806),arp(sip=0.0.0.0/0.0.0.0,tip=0.0.0.0/0.0.0.0,op=0/0,sha=00:00:00:00:00:00/00:00:00:00:00:00,tha=00:00:00:00:00:00/00:00:00:00:00:00), packets:0, bytes:0, used:5.150s, offloaded:yes, dp:tc, actions:push_vlan(vid=398,pcp=0),bond-pf

Here, the entire arp header (sender ip/hardware address, target ip/hardware address) is masked with 0, so only eth_type=0x086 should be used to make decisions and offloaded to hardware.

Below is offloaded flow with VxLAN.

ufid:d56153f3-f8c7-4e8a-a6ce-3ddeb73cf27a, skb_priority(0/0),tunnel(tun_id=0x3d,src=10.10.147.101,dst=10.10.147.104,ttl=0/0,tp_dst=4789,flags(+key)),skb_mark(0/0),in_port(vxlan_sys_4789),packet_type(ns=0/0,id=0/0),eth(src=fa:16:3e:45:a9:e2,dst=f8:f2:1e:03:bf:e6),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=0.0.0.0/0.0.0.0,proto=0/0,tos=0/0,ttl=0/0,frag=no), packets:19, bytes:1862, used:0.080s, offloaded:yes, dp:tc, actions:eth0
ufid:3bcda9c2-2d36-4692-82c2-aa631f015560, skb_priority(0/0),skb_mark(0/0),in_port(eth0),packet_type(ns=0/0,id=0/0),eth(src=f8:f2:1e:03:bf:e6,dst=fa:16:3e:45:a9:e2),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=0.0.0.0/0.0.0.0,proto=0/0,tos=0/0x3,ttl=0/0,frag=no), packets:18, bytes:2736, used:0.090s, offloaded:yes, dp:tc, actions:set(tunnel(tun_id=0x3d,src=10.10.147.104,dst=10.10.147.101,ttl=64,tp_dst=4789,flags(key))),vxlan_sys_4789

Tunnel type depends on the type of interface Encapsulation. It supports VXLAN / NVGRE / RAW and GENEVE. 

flags(key) are ignored and not offloaded to hardware.  

System validations

  • Ensure SR-IOV and VT-d are enabled on the system. 
  • Enable IOMMU in Linux by adding intel_iommu=on to kernel parameters, for example, using GRUB.

Debugs:

  • Ovs debugs are not helpful currently. Enhancement of debuggability of flow programming however fixed in the latest version. Though We can enable debug on ovs as below.

$ ovs-appctl vlog/set dpif_netlink:file:DBG;

Logs are printed at /var/log/openvswitch/ovs-vswitchd.log. 

For example, following is a capture of failed flow programming. 

# less /var/log/openvswitch/ovs-vswitchd.log  | grep -Ei “ERR|syndrome”

2019-02-19T12:48:46.126Z|00001|dpif_netlink(handler1)|ERR|failed to offload flow: Operation not supported

Parallelly checking ovsdb logs too shall be helpful. They are available at /var/log/openvswitch/ovsdb-server.log

Mellanox has provided a system information script which generally capture all the required information from the system. It is available at  

https://github.com/Mellanox/linux-sysinfo-snapshot/blob/master/sysinfo-snapshot.py

And can be run as 

# ./sysinfo-snapshot.py –asap –asap_tc –ibdiagnet –openstack

Note: If the action is not supported, the rule cannot be offloaded. Similarly When the fields cannot be identified, the rule cannot be offloaded.

CLIs:

$ovs-dpctl dump-flows -m type=offloaded

$tc filter show dev ens1_0 ingress

$tc -s filter show dev ens1_0 ingress

$ tc monitor   

Ethtool provides more insights. 

Tov view number of channels:  ethtool -l <uplink representor>

To check Statistics: ethtool -I <uplink/VFs>

To see driver Information: ethtool -i <uplink rep>

To check ring sizes: ethtool -g <uplink rep>

Enabled features: ethtool -k <uplink/VFs>

Parellelly, we can check the traffic flow using tcpdump at representator port and PF ports. 

Notes

  • Admin operations on VF’s link state controlled by VF rep link state. That means, any changes to link state of representor port, effects link state of VF.
  • Representator port statistics presents statistics of VF as well. 

Limitations:

  • “openvswitch” firewall driver cannot be used with offload as connection tracking properties of the flows are not supported in offload path. The default “iptables_hybrid” or “none” is supported. Patches have been merged to Ovs 2.13 to support connection tracking.

2 thoughts on “Lets understand the openvswitch hardware offload!

  1. Hi,i am jack chen,
    I have created a VF and created a VM using OpenStack. I have created VF lag by following the instructions provided in the Mellanox official guide.

    1、 echo 12 > /sys/class/net/enp4s0f0/device/sriov_numvfs
    echo 12 > /sys/class/net/enp4s0f1/device/sriov_numvfs

    2、 Unbind the VFs.

    echo 0000:04:00.2 > /sys/bus/pci/drivers/mlx5_core/unbind
    echo 0000:04:00.3 > /sys/bus/pci/drivers/mlx5_core/unbind

    3、 devlink dev eswitch set pci/0000:3b:00.0 mode switchdev
    devlink dev eswitch set pci/0000:3b:00.1 mode switchdev

    4、Run the openvswitch service.

    5、 systemctl start openvswitch

    6、 ovs-vsctl add-br ovs-sriov

    7、 ovs-vsctl set Open_vSwitch . other_config:hw-offload=true

    8、 systemctl restart openvswitch

    9、configuration SRIOV VF lag

    modprobe bonding mode=802.3ad
    Ifup bond0 (make sure ifcfg file is present with desired bond configuration)
    ip link set enp4s0f0 master bond0
    ip link set enp4s0f1 master bond0
    ovs-vsctl add-port ovs-sriov bond0
    ip link set dev bond0 up

    The following information is displayed in the kernel log when a bond is created:
    mlx5_cmd_check:795:(pid 40329): CREATE_LAG(0x840) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0x2c5b14)
    mlx5_create_lag:321:(pid 40329): Failed to create LAG (-22)
    mlx5_activate_lag:360:(pid 40329): Failed to activate VF LAG
    Make sure all VFs are unbound prior to VF LAG activation or deactivation

    In addition, data packets sent from VMs cannot be distributed to different network ports.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s