Measuring TCP Latency Using eBPF: Part 2

Measuring TCP Latency Using eBPF: Part 2 - Kernel Space

Sun Sep 08 2024

This is Part 2 in a series of posts describing the implementation of tcplat, a Go program which incorporates eBPF to passively measure the latency of TCP flows.

Part 1: “Hello World!”
Part 2: Kernel Space (this post)
Part 3: User Space

In this post, we’ll review the method by which we can passively measure TCP latency, and then implement the kernel space component of our program.

All the code for this part is located in this directory.

Measuring Latency

During a TCP connection, the TCP header for each packet supplies information we can use to measure the latency of a TCP connection: within this header, it’s possible to set eight different control bits which are used to indicate the state that a particular connection is in.

These control bits are used in the three-way handshake TCP uses for connection establishment:

The client first sends a packet which has the SYN control bit set, indicating it wants to establish a connection
The server replies with a packet which has both the SYN and ACK bits set, accepting the client’s request
Subsequent TCP packets between the client and server only have the ACK bit set

SYN SYN/ACK Connection

We can utilise this handshake process to calculate the latency of a connection: we measure the time taken to receive a SYN-ACK packet in response to sending out a SYN packet. Performing this measurement requires observability into all packets leaving and entering our machine, (known as egressing and ingressing respectively), so that we can identify SYN/SYN-ACK pairs — this is a exactly what eBPF enables us to do.

Additionally, we’ve got one more question to answer — how do we identify TCP packets related to a particular connection?

Well, given a TCP packet, we can identify its related TCP connection using its four-tuple, this is defined as (source IP address, source port, destination IP address, destination port).

An example SYN/SYN-ACK pair is defined below, notice how the pair has inverse source and destination IP addresses and ports.

Packet	Src. IP	Src. Port	Dst. IP	Dst. Port
`SYN`	192.168.0.1	49152	10.41.0.1	443
`SYN-ACK`	10.41.0.1	443	192.168.0.1	49152

System Overview

An example eBPF program could be structured as the following:

Intercept all egressing and ingressing packets
Note the time seen for each TCP packet which is a SYN or SYN-ACK, including its four-tuple data
Store this information in an eBPF map so that data can persist between successive calls to the program
If a SYN-ACK packet is intercepted, check if it has a corresponding SYN entry in the map. If it does, then calculate the latency and print it to the trace_pipe file

eBPF-only Program Structure Diagram

This is completely valid, but we’re going to take a slightly different approach — instead of reading from the eBPF map and performing the SYN/SYN-ACK matching in the kernel, we’ll read from the map and perform the matching in a Go program which runs in user space. This is for illustration purposes, so we can take a look at how user space programs can interact with maps.

We’re going to adopt the following program structure:

A kernel space eBPF program intercepts all egressing and ingressing packets
Note the time seen for each TCP packet which is a SYN or SYN-ACK, including its four-tuple data
Store this information in an eBPF map so that data can persist between successive calls to the program
A user space Go program reads from the map, parses the metadata and calculates the latency if it finds a SYN/SYN-ACK pair

Hybrid Program Structure Diagram

Now that we’ve decided on how to structure our program, let’s take a quick look at the Linux kernel’s networking stack, and decide where to attach our eBPF program. There are two specific subsystems which we may want to consider hooking into.

eXpress Data Path (XDP) — XDP is the lowest layer of the Linux kernel networking stack, it allows packet processing before memory allocation is performed by the OS, avoiding additional overhead imposed by processing inside the kernel, making it the subsystem which is able to achieve the highest throughput for packet processing. This factor means it’s especially suitable for tasks such as packet filtering, load balancing, or DDoS mitigation.
Traffic Control (TC) — TC is a kernel subsystem which is classically used for throttling or prioritising certain flows on specific interfaces, it is the kernel space counterpart of the user space utility tc. By the time a packet has reached TC, it has been allocated a socket buffer, (known as an sk_buff or skb), leading to a memory allocation overhead.

Critically, both subsystems can be used for flow monitoring, meaning they are both valid for our use case. At first glance, XDP might appear to be the better choice given its efficiency, however, XDP is only able to process packets which are ingressing unlike TC which can also intercept egressing packets. Because of this, we’re going to hook into the TC subsystem instead.

Traffic Control (TC)

In Linux, each interface has queue for both incoming (ingress) and outgoing (egress) packets.

Ingress Egress Queue

Using Traffic Control, we can control the behaviours of these queues in three different ways in order to manage traffic:

Shaping — controlling the rate of packet transmission
Scheduling — reordering packets to prefer sending certain ones before others
Dropping — dropping packets to make traffic stay below a desired bandwidth, when done on ingress this is known as policing

It is important to realise that we can only shape and schedule data that we send, and not the data that we receive. This is because the ingress queue is unbuffered, (in contrast to the egress queue which is buffered). Since we want packets to be processed as quickly as possible on arrival, we don’t buffer on ingress, because buffering every incoming packet would introduce overhead in terms of both memory and processing, which would become problematic in high-throughput environments where large volumes of traffic are ingressing.

Ingress Egress TC Actions

Traffic Control is split into the following three components:

Queueing Discipline (qdisc) — an algorithm which manages the queue of an interface. One of the simplest qdisc’s is pfifo which implements a First-In-First-Out queue. Each qdisc may contain multiple classes
Class — a class is a logical container for a specific type of traffic which contains its own traffic shaping policies. Classes are used to implement Quality of Service (QoS) policies, e.g. you can create one class for HTTP traffic, and another for VoIP, each with its own bandwidth allocation and priority
Filters — filters dictate how packets which are enqueued onto the qdisc they’re attached to are handled. To do this, they’re composed of two sub-components:
- Classifier — a mechanism which classifies a packet into one of the traffic classes attached to the qdisc
- Action — an action to perform on a packet, based on a successful classification match. Common actions include (1) pass the packet to the matched class, (2) redirect the packet to a different interface, and (3) drop the packet

Qdisc and Filter

TC interprets the return codes of classifiers and actions which dictate how a packet should be processed.

For classifiers:

Code	Meaning
`0`	Mismatch, if more classifiers exist they are run on the packet
`-1`	Packet classed using the `qdisc`’s default class
Other	Packet sent to class with same identifier as the return code

For actions, (this list is non-exhaustive):

Code	Meaning
`TC_ACT_UNSPEC (-1)`	Used the default action
`TC_ACT_OK (0)`	Allow the packet to proceed
`TC_ACT_SHOT (2)`	Drop the packet
`TC_ACT_PIPE (3)`	Iterate to the next action if available

In terms of hooking into TC, eBPF programs can be attached as classifiers or actions.

For many use cases, (e.g. packet filtering), eBPF classifiers alone are enough to process packets and decide what action should be taken, (e.g. dropped or accepted). However, to actually drop the packet, classifiers require an additional action to perform the drop. This incurs additional overhead, due to the context-switching required, where the eBPF program returns control back to the kernel.

To fix this problem, TC was expanded so that eBPF TC classifiers could be attached in direct-action mode. This flag tells the system that the return value from an eBPF classifier should instead be interpreted as the return value for an action, meaning that additional actions do not need to be defined for a classifer to process a packet, since the classifier itself defines how the packet should be actioned.

Since the TC subsystem no longer needs to call into an additional action module external to the kernel, it avoids introducing additional latency and processing overhead into the classifier logic, making direct-action classifiers much more performant than their non-direct-action counterparts.

This makes direct-action the preferred way to attach eBPF TC classifiers, and this is what we will do to hook into TC.

Intercepting TCP Packets

Ok, now that we’ve got that out of the way, let’s begin writing the kernel space component of our system — an eBPF program which intercepts TCP packets, scrapes data relevant for calculating latency, and then pushes said data to user space for further processing.

In terms of the kernel, this program is going to be a TC direct-action classifier, meaning that is executed per-packet, and it is responsible for returning TC action return codes which dictate how Traffic Control should process the packet.

When executed on each packet, the packet’s information is presented to the classifier via the __sk_buff structure, otherwise known as an skb. These refer to Linux kernel constructs called socket buffers.

Defining a Storage Format

Before we begin writing a program to intercept TCP packets, let’s decide on what the structure which we’ll use to store per-packet information that we’re scraping from each skb will look like.

First, we need to store the timestamp the packet was seen so that we can calculate the duration between two packets — this can be stored in a uint64_t.

uint64_t timestamp;

Next, we need a way to determine whether the packet contained a SYN or a SYN-ACK so that we can correlate it to an outgoing or incoming request/response for TCP connection establishment. This can be done using booleans.

bool syn;
bool ack;

Finally, we need data to uniquely identify a TCP connection, i.e. its four-tuple.

struct in6_addr src_ip;
struct in6_addr dst_ip;
__be16          src_port;
__be16          dst_port;

For the TCP connection data, the source and destination ports are specified as big-endian numbers, this is because network protocols are specified in big-endian.

Additionally, instead of using separate fields for either IPv4 or IPv6 addresses, only IPv6 addresses are used. This is possible due to a specific feature of the IPv6 specification, called “IPv4-Mapped IPv6 Address”, which enables us to embed IPv4 addresses into IPv6 ones.

When embedding, the IPv4 address is specified in the last 32 bits of the 128-bit IPv6 address.

|                80 bits               | 16 |      32 bits        |
+--------------------------------------+--------------------------+
|0000..............................0000|FFFF|    IPv4 address     |
+--------------------------------------+----+---------------------+

Conveniently, IPv6 addresses are defined using C unions, which gives us three different ways to manipulate the underlying backing array allocated for an in6_addr:

As 16, 8-byte chunks
As 8, 16-byte chunks
As 4, 32-byte chunks

struct in6_addr {
    union {
        __u8   u6_addr8[16];
        __be16 u6_addr16[8];
        __be32 u6_addr32[4];
    } in6_u;
};

This makes it trivial to embed an IPv4 address into an IPv6 one, since we can directly refer to the 16-bit and 32-bit fields used to denote that the address is embdded.

struct in6_addr ip6 = {0};
ip6.in6_u.u6_addr32[3] = some_ip4_addr;
ip6.in6_u.u6_addr16[5] = 0xffff;

Since we use {0} syntax, our in6_addr is initialised to zero instead of containing garbage values, meaning we don’t have to set the first 80 bits of the address to zero.

All of this is combined to form the packet_t structure.

struct packet_t {
    struct in6_addr src_ip;
    struct in6_addr dst_ip;
    __be16          src_port;
    __be16          dst_port;
    bool            syn;
    bool            ack;
    uint64_t        timestamp;
};

Socket Buffer Non-Linearity

There’s one more thing we need to bear in mind before we begin scraping skb data.

skb’s contain a single linear buffer, and a set of zero or more page buffers. The memory location of the linear buffer is specified by the skb.data and skb.data_end pointers — it’s important because it’s the only buffer which eBPF programs may read from and write to. The total size of the linear buffer plus the page buffers (if applicable) is specified by the skb->len field.

Unfortunately, the packet data, (i.e. the headers plus the payload), is not guaranteed to be exclusively stored in the linear buffer. This means that in some cases, our desired header data may be stored in inaccessible page buffers, this is known as a non-linear skb.

Consider the following example for a TCP packet where the TCP header and the payload are inaccessible.

    skb.data --> +-----------------+
                 | Ethernet Header |
                 +-----------------+
                 |    IP Header    |
skb.data_end --> +-----------------+
                 |   TCP Header    |
                 +-----------------+
                 |     Payload     |
                 +-----------------+

eBPF provides a handy helper function to resolve this issue — the bpf_skb_pull_data helper.

Pull in non-linear data in case the skb is non-linear and not all of len are part of the linear section. Make len bytes from skb readable and writable.

— bpf-helpers, Linux manual page

So, using this function, we can ask the kernel to make len bytes from the skb readable and writable, the kernel takes care of pulling in additional data from the non-linear parts of the skb if needed. It returns a negative number on failure, such as if it’s unable to pull the required amount of bytes in, in which case we abort processing of the packet.

We can actually write a quick program to visualise the expansion of our skb, where we’re going to always pull in all non-linear data. This program will sit in a file called skb_pull_vis.bpf.c, and we use SEC("tc") to specify it’s a TC classifier program.

#include <linux/bpf.h>
#include <linux/pkt_cls.h>
#include <bpf/bpf_helpers.h>

SEC("tc")
int skb_pull_vis(struct __sk_buff *skb) {
    bpf_printk("(Before) Linear: %d, Total: %d", \
            skb->data_end - skb->data, skb->len);

    if (bpf_skb_pull_data(skb, skb->len) < 0)
        return TC_ACT_OK;

    bpf_printk("(After) Linear: %d, Total: %d", \
            skb->data_end - skb->data, skb->len);

    return TC_ACT_OK;
}

char LICENSE[] SEC("license") = "GPL v2";

Compiling this program follows the exact same steps as before.

> clang -Wall -O2 -target bpf -c skb_pull_vis.bpf.c -o skb_pull_vis.bpf.o

However, loading and attaching it is a more involved process.

We must attach a qdisc to the interface which we want to intercept traffic for
We then attach our eBPF program as a classifier for the qdisc in direct-action mode

In our case, we are going to use the clsact qdisc.

clsact is the recommended qdisc for direct-action eBPF classifiers because:

It holds only classifiers, meaning it does not perform any queueing which is a feature we don’t need
It can be attached to both ingress and egress paths

Let’s first list the interfaces on our machine, so we can decide on which one we should attach to, (the following command requires the iproute2 package).

> ip link show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: enp0s1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
    link/ether 7a:93:35:e0:e1:85 brd ff:ff:ff:ff:ff:ff

We want to attach the clsact qdisc to any interface which has traffic flowing over it, so that our classifier is triggered and the traces we defined are produced. For the machine which I’m currently developing on, this would be the enp0s1 interface, so we will attach to it.

> sudo tc qdisc add dev enp0s1 clsact

Next, clsact offers two hook points — ingress and egress — which we can attach our classifier to. Note that we must also specify the section in the object file that our classifier is stored in, so that the tc utility knows where to load the classifier from.

> sudo tc filter add dev enp0s1 ingress bpf direct-action obj skb_pull_vis.bpf.o sec tc
> sudo tc filter add dev enp0s1 egress bpf direct-action obj skb_pull_vis.bpf.o sec tc

Just like before, we can view the generated trace.

> sudo cat /sys/kernel/debug/tracing/trace_pipe
<idle>-0       [000] ..s2. 46791.594477: bpf_trace_printk: (Before) Linear: 126, Total: 126
<idle>-0       [000] ..s2. 46791.594523: bpf_trace_printk: (After) Linear: 126, Total: 126
  sshd-1026    [001] b..1. 46791.594566: bpf_trace_printk: (Before) Linear: 54, Total: 554
  sshd-1026    [001] b..1. 46791.594567: bpf_trace_printk: (After) Linear: 554, Total: 554

The trace shows us two things about the behaviour of bpf_skb_pull_data with len set to skb->len:

If all our data is stored in the linear buffer, then bpf_skb_pull_data is a no-op
If some data is stored in non-linear page buffers, then it’s all pulled in to the linear buffer, and the kernel takes care of updating the skb->data_end pointer to point to the end of the expanded buffer

Finally, to unload our eBPF classifier we simply delete the qdisc it’s attached to.

> sudo tc qdisc del dev enp0s1 clsact

Reading the Socket Buffer’s Data

Now, let’s populate our packet_t structure for a given skb.

We’re going to start with a simple skeleton program called tcplat, and iteratively develop its functionality. (Don’t worry too much about which headers need to be included, at the end of this section I will display our final eBPF program, which will include all relevant headers.)

#include <linux/bpf.h>
#include <linux/pkt_cls.h>
#include <bpf/bpf_helpers.h>

SEC("tc")
int tcplat(struct __sk_buff *skb) {
    // New code goes here...
    return TC_ACT_OK;
}

char LICENSE[] SEC("license") = "GPL v2";

Before we begin reading the socket buffer’s data, there’s one thing we should bear in mind.

The earlier that we can abort processing of invalid packets, the more efficient our program is, and, since we’re intercepting TCP/IP packets, we’d like to ignore all packets which don’t use IPv4 or IPv6 as their Layer 3 protocol. The skb actually tells us this information via the skb->protocol field, which identifies the L3 protocol of the packet as one of the ETH_P_* values defined in the uapi/linux/if_ether.h file.

So, the first thing we’re going to do is abort processing all packets which don’t use IPv4/IPv6 as their L3 protocol, and hand control back to the kernel.

However, we can’t immediately compare the ETH_P_* values defined in our host system’s header file with the skb->protocol field because they have different endianness, (big-endian and little-endian respectively). To fix this, we can use the bpf_ntohs helper function which converts data from network byte-order (big-endian) to host byte-order (little-endian).

uint32_t host_protocol = bpf_ntohs(skb->protocol);
if (host_protocol != ETH_P_IP && host_protocol != ETH_P_IPV6)
    return TC_ACT_OK;

Now that we’re guaranteed to be processing IPv4/IPv6 packets, we want to ensure that both the IP and TCP headers are present in the linear part of the skb, and pull them in if not — we populate our packet_t with information from these headers.

On failure, we hand control back to the kernel.

(Also, just remember that at this stage, we’re not guaranteed that our skb will actually contain a TCP header, even if its linear buffer contains the minimum amount of data required to parse one.)

uint32_t ip_header_len = (host_protocol == ETH_P_IP) ? \
    sizeof(struct iphdr) : sizeof(struct ipv6hdr);
uint32_t total_header_len = \
    sizeof(struct ethhdr) + ip_header_len + sizeof(struct tcphdr);
if (bpf_skb_pull_data(skb, total_header_len) < 0)
    return TC_ACT_OK;

Next, in order to begin reading the headers, we must appease the eBPF verifier by performing a bounds check to ensure that the skb’s linear buffer is large enough to accomodate our headers. Otherwise, the eBPF verifier will reject our program due to potential out-of-bounds memory access, which could lead to reading from incorrect memory areas or cause the program to crash.

Parsing the memory locations is a two-step process:

We cast skb->data which is defined as __32 to uint64_t to make it equivalent to the pointer size of the architecture we are currently developing on (64-bit). The extra 32 bits are initialised to zero
We then cast the uint64_t to a uint8_t pointer. We specifically use a uint8_t instead of a void pointer because we are going to perform pointer arithmetic on it, which is invalid on void pointers in standard-C unless extensions are used

uint8_t *head = (uint8_t*)(uint64_t)skb->data;
uint8_t *tail = (uint8_t*)(uint64_t)skb->data_end;
if (head + total_header_len > tail)
    return TC_ACT_OK;

Before we continue further processing of the headers, let’s take a moment to allocate our packet onto the stack.

struct packet_t pkt = {0};

At this stage, we can populate pkt with the source and destination IP address. We also take this chance to abort processing the packet if the protocol that the IP packet is encapsulating is not TCP.

struct iphdr *ip;
struct ipv6hdr *ip6;
switch (host_protocol) {
    case ETH_P_IP:
        ip = (struct iphdr*) (head + sizeof(struct ethhdr));
        if (ip->protocol != IPPROTO_TCP)
            return TC_ACT_OK;

        pkt.src_ip.in6_u.u6_addr32[3] = ip->saddr;
        pkt.dst_ip.in6_u.u6_addr32[3] = ip->daddr;
        pkt.src_ip.in6_u.u6_addr16[5] = 0xffff;
        pkt.dst_ip.in6_u.u6_addr16[5] = 0xffff;

        break;
    case ETH_P_IPV6:
        ip6 = (struct ipv6hdr*) (head + sizeof(struct ethhdr));
        if (ip6->nexthdr != IPPROTO_TCP)
            return TC_ACT_OK;

        pkt.src_ip = ip6->saddr;
        pkt.dst_ip = ip6->daddr;

        break;
};

We’ve guaranteed that we’re processing a TCP packet, so let’s scrape the rest of the information. We make sure to only scrape data for packets which are SYN or SYN-ACK.

struct tcphdr *tcp = \
    (struct tcphdr*) (head + sizeof(struct ethhdr) + ip_header_len);
if (tcp->syn) {
    pkt.src_port = tcp->source;
    pkt.dst_port = tcp->dest;
    pkt.syn = tcp->syn;
    pkt.ack = tcp->ack;
    pkt.timestamp = bpf_ktime_get_ns();
}

So, the final step in our processing pipeline is to send the pkt structure to some user space program, to do this we will push the packet data to an eBPF map.

The BPF Ring Buffer map is a Multi-Producer Single-Consumer (MPSC) queue that fits our use case pretty well:

It supports efficient exchange of data between kernel and user space
It preserves the ordering of events that happen sequentially in time, (ensuring that our user space program will receive SYN packets before it receives SYN-ACK ones)

We are going to use BTF style to define our map, which we call pipe.

struct {
    __uint(type, BPF_MAP_TYPE_RINGBUF);
    __uint(max_entries, 512 * 1024);
} pipe SEC(".maps");

The map definition syntax is structured as follows:

struct { ... } pipe SEC(".maps") — defines a map called pipe and places it in the ".maps" section of the compiled ELF file which is dedicated to BTF-style map definitions
__uint(type, BPF_MAP_TYPE_RINGBUF) — sets the map to be a BPF ring buffer. More map types can be seen by viewing the uapi/linux/bpf.h file
__uint(max_entries, 512 * 1024) — sets the maximum size of the ring buffer to be 512 KB

Note that __uint(name, val) is a macro provided by the bpf/bpf_helpers.h file.

Let’s take a really quick aside for a second — if we expand the macros used to define our map, it evaluates to the following.

struct {
    int (*type)[BPF_MAP_TYPE_RINGBUF];
    int (*max_entries)[512 * 1024];
} pipe SEC(".maps");

This seems slightly nonsensical, why does the type field point to an integer array of length BPF_MAP_TYPE_RINGBUF?

So, if you pause for a sec and take further look on the Internet at how BTF-style maps are defined, you might be further confused. This is because, many examples define them using a different syntax, which seems more sensible.

struct {
    int type;
    int max_entries;
    // Some fields omitted...
} pipe SEC(".maps") = {
    .type = BPF_MAP_TYPE_RINGBUF,
    .max_entries = 512 * 1024,
};

As it turns out, libbpf is responsible for the conversion from the first structure into the second one, which is then embedded into the ELF file. This is what makes our evaluated macro valid, nice.

Ok, aside done.

Well, now that we defined our map we can push data to it using the bpf_ringbuf_output helper function, where some user space program is expected to consume the data asynchronously.

bpf_ringbuf_output(&pipe, &pkt, sizeof(pkt), 0);

And this concludes our kernel space implementation — you can view the full code for our eBPF program here.

What’s Next?

Thank you for reading so far! Please let me if you have some feedback or corrections.

If you’re looking for more TC reading, you should check out:

The TC How-To Guide
The TC Advanced How-To Guide
This paper which compares XDP with TC
This paper detailing the classifier-action subsystem architecture
The tc manual page
Quentin Monnet’s post about direct-action

Now you should be ready to move on to Part 3, where we implement the Go program which reads from the pipe map and performs the latency calculation.