Sun Sep 08 2024
This is Part 2 in a series of posts describing the implementation of tcplat
, a
Go program which incorporates eBPF to passively measure the latency of TCP flows.
In this post, we’ll review the method by which we can passively measure TCP latency, and then implement the kernel space component of our program.
All the code for this part is located in this directory.
During a TCP connection, the TCP header for each packet supplies information we can use to measure the latency of a TCP connection: within this header, it’s possible to set eight different control bits which are used to indicate the state that a particular connection is in.
These control bits are used in the three-way handshake TCP uses for connection establishment:
SYN
control bit set, indicating
it wants to establish a connectionSYN
and ACK
bits set,
accepting the client’s requestACK
bit setWe can utilise this handshake process to calculate the latency of a
connection: we measure the time taken to receive a SYN-ACK
packet
in response to sending out a SYN
packet.
Performing this measurement requires observability into all packets
leaving and entering our machine, (known as egressing and ingressing respectively),
so that we can identify SYN
/SYN-ACK
pairs — this is a exactly what
eBPF enables us to do.
Additionally, we’ve got one more question to answer — how do we identify TCP packets related to a particular connection?
Well, given a TCP packet, we can identify its related TCP connection using its
four-tuple, this is defined as (source IP address, source port, destination IP address, destination port)
.
An example SYN
/SYN-ACK
pair is defined below, notice how the pair has inverse
source
and destination
IP addresses and ports.
Packet | Src. IP | Src. Port | Dst. IP | Dst. Port |
---|---|---|---|---|
SYN | 192.168.0.1 | 49152 | 10.41.0.1 | 443 |
SYN-ACK | 10.41.0.1 | 443 | 192.168.0.1 | 49152 |
An example eBPF program could be structured as the following:
SYN
or SYN-ACK
,
including its four-tuple dataSYN-ACK
packet is intercepted, check if it has a corresponding
SYN
entry in the map. If it does, then calculate the latency and
print it to the trace_pipe
fileThis is completely valid, but we’re going to take a slightly different
approach — instead of reading from the eBPF map and performing the
SYN
/SYN-ACK
matching in the kernel, we’ll read from the map and
perform the matching in a Go program which runs in user space. This is
for illustration purposes, so we can take a look at how user space programs
can interact with maps.
We’re going to adopt the following program structure:
SYN
or SYN-ACK
,
including its four-tuple dataSYN
/SYN-ACK
pairNow that we’ve decided on how to structure our program, let’s take a quick look at the Linux kernel’s networking stack, and decide where to attach our eBPF program. There are two specific subsystems which we may want to consider hooking into.
tc
. By the time a packet has reached TC, it has been allocated a socket
buffer, (known as an sk_buff
or skb
), leading to a memory allocation overhead.Critically, both subsystems can be used for flow monitoring, meaning they are both valid for our use case. At first glance, XDP might appear to be the better choice given its efficiency, however, XDP is only able to process packets which are ingressing unlike TC which can also intercept egressing packets. Because of this, we’re going to hook into the TC subsystem instead.
In Linux, each interface has queue for both incoming (ingress) and outgoing (egress) packets.
Using Traffic Control, we can control the behaviours of these queues in three different ways in order to manage traffic:
It is important to realise that we can only shape and schedule data that we send, and not the data that we receive. This is because the ingress queue is unbuffered, (in contrast to the egress queue which is buffered). Since we want packets to be processed as quickly as possible on arrival, we don’t buffer on ingress, because buffering every incoming packet would introduce overhead in terms of both memory and processing, which would become problematic in high-throughput environments where large volumes of traffic are ingressing.
Traffic Control is split into the following three components:
qdisc
) — an algorithm which manages the queue of an interface.
One of the simplest qdisc
’s is pfifo
which implements a First-In-First-Out queue.
Each qdisc
may contain multiple classesqdisc
they’re attached
to are handled. To do this, they’re composed of two sub-components:
qdisc
TC interprets the return codes of classifiers and actions which dictate how a packet should be processed.
For classifiers:
Code | Meaning |
---|---|
0 | Mismatch, if more classifiers exist they are run on the packet |
-1 | Packet classed using the qdisc ’s default class |
Other | Packet sent to class with same identifier as the return code |
For actions, (this list is non-exhaustive):
Code | Meaning |
---|---|
TC_ACT_UNSPEC (-1) | Used the default action |
TC_ACT_OK (0) | Allow the packet to proceed |
TC_ACT_SHOT (2) | Drop the packet |
TC_ACT_PIPE (3) | Iterate to the next action if available |
In terms of hooking into TC, eBPF programs can be attached as classifiers or actions.
For many use cases, (e.g. packet filtering), eBPF classifiers alone are enough to process packets and decide what action should be taken, (e.g. dropped or accepted). However, to actually drop the packet, classifiers require an additional action to perform the drop. This incurs additional overhead, due to the context-switching required, where the eBPF program returns control back to the kernel.
To fix this problem, TC was expanded so that eBPF TC classifiers could be attached in
direct-action
mode. This flag tells the system that the return value from an eBPF
classifier should instead be interpreted as the return value for an action, meaning that
additional actions do not need to be defined for a classifer to process a packet, since
the classifier itself defines how the packet should be actioned.
Since the TC subsystem no longer needs to call into an additional action module external
to the kernel, it avoids introducing additional latency and processing overhead into the
classifier logic, making direct-action
classifiers much more performant than their
non-direct-action
counterparts.
This makes direct-action
the preferred way to attach eBPF TC classifiers, and this is what
we will do to hook into TC.
Ok, now that we’ve got that out of the way, let’s begin writing the kernel space component of our system — an eBPF program which intercepts TCP packets, scrapes data relevant for calculating latency, and then pushes said data to user space for further processing.
In terms of the kernel, this program is going to be a TC direct-action
classifier,
meaning that is executed per-packet, and it is responsible for returning TC action
return codes which dictate how Traffic Control should process the packet.
When executed on each packet, the packet’s information is presented to the classifier
via the __sk_buff
structure,
otherwise known as an skb
. These refer to Linux kernel constructs called
socket buffers.
Before we begin writing a program to intercept TCP packets, let’s decide on what
the structure which we’ll use to store per-packet information that we’re scraping
from each skb
will look like.
First, we need to store the timestamp the packet was seen so that we
can calculate the duration between two packets — this can be stored in
a uint64_t
.
uint64_t timestamp;
Next, we need a way to determine whether the packet contained a SYN
or a SYN-ACK
so that we can correlate it to an outgoing or incoming
request/response for TCP connection establishment. This can be done
using booleans.
bool syn;
bool ack;
Finally, we need data to uniquely identify a TCP connection, i.e. its four-tuple.
struct in6_addr src_ip;
struct in6_addr dst_ip;
__be16 src_port;
__be16 dst_port;
For the TCP connection data, the source and destination ports are specified as big-endian numbers, this is because network protocols are specified in big-endian.
Additionally, instead of using separate fields for either IPv4 or IPv6 addresses, only IPv6 addresses are used. This is possible due to a specific feature of the IPv6 specification, called “IPv4-Mapped IPv6 Address”, which enables us to embed IPv4 addresses into IPv6 ones.
When embedding, the IPv4 address is specified in the last 32 bits of the 128-bit IPv6 address.
| 80 bits | 16 | 32 bits |
+--------------------------------------+--------------------------+
|0000..............................0000|FFFF| IPv4 address |
+--------------------------------------+----+---------------------+
Conveniently, IPv6 addresses are defined using C unions,
which gives us three different ways to manipulate the underlying backing array
allocated for an in6_addr
:
struct in6_addr {
union {
__u8 u6_addr8[16];
__be16 u6_addr16[8];
__be32 u6_addr32[4];
} in6_u;
};
This makes it trivial to embed an IPv4 address into an IPv6 one, since we can directly refer to the 16-bit and 32-bit fields used to denote that the address is embdded.
struct in6_addr ip6 = {0};
ip6.in6_u.u6_addr32[3] = some_ip4_addr;
ip6.in6_u.u6_addr16[5] = 0xffff;
Since we use {0}
syntax, our in6_addr
is initialised to zero instead
of containing garbage values, meaning we don’t have to set the first 80
bits of the address to zero.
All of this is combined to form the packet_t
structure.
struct packet_t {
struct in6_addr src_ip;
struct in6_addr dst_ip;
__be16 src_port;
__be16 dst_port;
bool syn;
bool ack;
uint64_t timestamp;
};
There’s one more thing we need to bear in mind before we begin scraping skb
data.
skb
’s contain a single linear buffer, and a set of zero or more
page buffers. The memory location of the linear buffer is specified by the
skb.data
and skb.data_end
pointers — it’s important because it’s the only
buffer which eBPF programs may read from and write to. The total size of the
linear buffer plus the page buffers (if applicable) is specified by the skb->len
field.
Unfortunately, the packet data, (i.e. the headers plus the payload), is not
guaranteed to be exclusively stored in the linear buffer. This means that in
some cases, our desired header data may be stored in inaccessible page buffers,
this is known as a non-linear skb
.
Consider the following example for a TCP packet where the TCP header and the payload are inaccessible.
skb.data --> +-----------------+
| Ethernet Header |
+-----------------+
| IP Header |
skb.data_end --> +-----------------+
| TCP Header |
+-----------------+
| Payload |
+-----------------+
eBPF provides a handy helper function to resolve this issue — the bpf_skb_pull_data
helper.
Pull in non-linear data in case the skb is non-linear and not all of
len
are part of the linear section. Makelen
bytes from skb readable and writable.
— bpf-helpers, Linux manual page
So, using this function, we can ask the kernel to make len
bytes
from the skb
readable and writable, the kernel takes care of pulling
in additional data from the non-linear parts of the skb
if needed.
It returns a negative number on failure, such as if it’s unable to pull
the required amount of bytes in, in which case we abort processing of
the packet.
We can actually write a quick program to visualise the expansion of our
skb
, where we’re going to always pull in all non-linear data. This program
will sit in a file called skb_pull_vis.bpf.c
, and we use SEC("tc")
to
specify it’s a TC classifier program.
#include <linux/bpf.h>
#include <linux/pkt_cls.h>
#include <bpf/bpf_helpers.h>
SEC("tc")
int skb_pull_vis(struct __sk_buff *skb) {
bpf_printk("(Before) Linear: %d, Total: %d", \
skb->data_end - skb->data, skb->len);
if (bpf_skb_pull_data(skb, skb->len) < 0)
return TC_ACT_OK;
bpf_printk("(After) Linear: %d, Total: %d", \
skb->data_end - skb->data, skb->len);
return TC_ACT_OK;
}
char LICENSE[] SEC("license") = "GPL v2";
Compiling this program follows the exact same steps as before.
> clang -Wall -O2 -target bpf -c skb_pull_vis.bpf.c -o skb_pull_vis.bpf.o
However, loading and attaching it is a more involved process.
direct-action
modeIn our case, we are going to use the clsact
qdisc.
clsact
is the recommended qdisc for direct-action
eBPF classifiers because:
Let’s first list the interfaces on our machine, so we can decide on which
one we should attach to, (the following command requires the iproute2
package).
> ip link show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: enp0s1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
link/ether 7a:93:35:e0:e1:85 brd ff:ff:ff:ff:ff:ff
We want to attach the clsact
qdisc to any interface which has traffic
flowing over it, so that our classifier is triggered and the traces we
defined are produced. For the machine which I’m currently developing on,
this would be the enp0s1
interface, so we will attach to it.
> sudo tc qdisc add dev enp0s1 clsact
Next, clsact
offers two hook points — ingress
and egress
— which we can
attach our classifier to. Note that we must also specify the section in the object file
that our classifier is stored in, so that the tc
utility knows where to load
the classifier from.
> sudo tc filter add dev enp0s1 ingress bpf direct-action obj skb_pull_vis.bpf.o sec tc
> sudo tc filter add dev enp0s1 egress bpf direct-action obj skb_pull_vis.bpf.o sec tc
Just like before, we can view the generated trace.
> sudo cat /sys/kernel/debug/tracing/trace_pipe
<idle>-0 [000] ..s2. 46791.594477: bpf_trace_printk: (Before) Linear: 126, Total: 126
<idle>-0 [000] ..s2. 46791.594523: bpf_trace_printk: (After) Linear: 126, Total: 126
sshd-1026 [001] b..1. 46791.594566: bpf_trace_printk: (Before) Linear: 54, Total: 554
sshd-1026 [001] b..1. 46791.594567: bpf_trace_printk: (After) Linear: 554, Total: 554
The trace shows us two things about the behaviour of bpf_skb_pull_data
with len
set to
skb->len
:
bpf_skb_pull_data
is a no-opskb->data_end
pointer to
point to the end of the expanded bufferFinally, to unload our eBPF classifier we simply delete the qdisc it’s attached to.
> sudo tc qdisc del dev enp0s1 clsact
Now, let’s populate our packet_t
structure for a given skb
.
We’re going to start with a simple skeleton program called tcplat
,
and iteratively develop its functionality. (Don’t worry too much about which
headers need to be included, at the end of this section I will display our
final eBPF program, which will include all relevant headers.)
#include <linux/bpf.h>
#include <linux/pkt_cls.h>
#include <bpf/bpf_helpers.h>
SEC("tc")
int tcplat(struct __sk_buff *skb) {
// New code goes here...
return TC_ACT_OK;
}
char LICENSE[] SEC("license") = "GPL v2";
Before we begin reading the socket buffer’s data, there’s one thing we should bear in mind.
The earlier that we can abort processing of invalid packets, the more
efficient our program is, and, since we’re intercepting TCP/IP packets,
we’d like to ignore all packets which don’t use IPv4 or IPv6 as their
Layer 3 protocol. The skb
actually tells us this information via the
skb->protocol
field, which identifies the L3 protocol of the packet as
one of the ETH_P_*
values defined in the uapi/linux/if_ether.h
file.
So, the first thing we’re going to do is abort processing all packets which don’t use IPv4/IPv6 as their L3 protocol, and hand control back to the kernel.
However, we can’t immediately compare the ETH_P_*
values defined in our
host system’s header file with the skb->protocol
field because they
have different endianness, (big-endian and little-endian respectively).
To fix this, we can use the bpf_ntohs
helper function which converts
data from network byte-order (big-endian) to host byte-order (little-endian).
uint32_t host_protocol = bpf_ntohs(skb->protocol);
if (host_protocol != ETH_P_IP && host_protocol != ETH_P_IPV6)
return TC_ACT_OK;
Now that we’re guaranteed to be processing IPv4/IPv6 packets, we want to ensure
that both the IP and TCP headers are present in the linear part of the skb
,
and pull them in if not — we populate our packet_t
with information from these
headers.
On failure, we hand control back to the kernel.
(Also, just remember that at this stage, we’re not guaranteed that our skb
will
actually contain a TCP header, even if its linear buffer contains the minimum amount
of data required to parse one.)
uint32_t ip_header_len = (host_protocol == ETH_P_IP) ? \
sizeof(struct iphdr) : sizeof(struct ipv6hdr);
uint32_t total_header_len = \
sizeof(struct ethhdr) + ip_header_len + sizeof(struct tcphdr);
if (bpf_skb_pull_data(skb, total_header_len) < 0)
return TC_ACT_OK;
Next, in order to begin reading the headers, we must appease the eBPF verifier by
performing a bounds check to ensure that the skb
’s linear buffer is large enough
to accomodate our headers. Otherwise, the eBPF verifier will reject our program due
to potential out-of-bounds memory access, which could lead to reading from incorrect
memory areas or cause the program to crash.
Parsing the memory locations is a two-step process:
skb->data
which is defined as __32
to uint64_t
to make it
equivalent to the pointer size of the architecture we are currently developing
on (64-bit). The extra 32 bits are initialised to zerouint64_t
to a uint8_t
pointer. We specifically use a uint8_t
instead of a void
pointer because we are going to perform pointer arithmetic
on it, which is invalid on void
pointers in standard-C unless extensions are
useduint8_t *head = (uint8_t*)(uint64_t)skb->data;
uint8_t *tail = (uint8_t*)(uint64_t)skb->data_end;
if (head + total_header_len > tail)
return TC_ACT_OK;
Before we continue further processing of the headers, let’s take a moment to allocate our packet onto the stack.
struct packet_t pkt = {0};
At this stage, we can populate pkt
with the source and destination IP
address. We also take this chance to abort processing the packet if the
protocol that the IP packet is encapsulating is not TCP.
struct iphdr *ip;
struct ipv6hdr *ip6;
switch (host_protocol) {
case ETH_P_IP:
ip = (struct iphdr*) (head + sizeof(struct ethhdr));
if (ip->protocol != IPPROTO_TCP)
return TC_ACT_OK;
pkt.src_ip.in6_u.u6_addr32[3] = ip->saddr;
pkt.dst_ip.in6_u.u6_addr32[3] = ip->daddr;
pkt.src_ip.in6_u.u6_addr16[5] = 0xffff;
pkt.dst_ip.in6_u.u6_addr16[5] = 0xffff;
break;
case ETH_P_IPV6:
ip6 = (struct ipv6hdr*) (head + sizeof(struct ethhdr));
if (ip6->nexthdr != IPPROTO_TCP)
return TC_ACT_OK;
pkt.src_ip = ip6->saddr;
pkt.dst_ip = ip6->daddr;
break;
};
We’ve guaranteed that we’re processing a TCP packet, so let’s scrape the
rest of the information. We make sure to only scrape data for packets which
are SYN
or SYN-ACK
.
struct tcphdr *tcp = \
(struct tcphdr*) (head + sizeof(struct ethhdr) + ip_header_len);
if (tcp->syn) {
pkt.src_port = tcp->source;
pkt.dst_port = tcp->dest;
pkt.syn = tcp->syn;
pkt.ack = tcp->ack;
pkt.timestamp = bpf_ktime_get_ns();
}
So, the final step in our processing pipeline is to send the pkt
structure to some user space program, to do this we will push the
packet data to an eBPF map.
The BPF Ring Buffer map is a Multi-Producer Single-Consumer (MPSC) queue that fits our use case pretty well:
SYN
packets before
it receives SYN-ACK
ones)We are going to use BTF style
to define our map, which we call pipe
.
struct {
__uint(type, BPF_MAP_TYPE_RINGBUF);
__uint(max_entries, 512 * 1024);
} pipe SEC(".maps");
The map definition syntax is structured as follows:
struct { ... } pipe SEC(".maps")
— defines a map called pipe
and places
it in the ".maps"
section of the compiled ELF file which is dedicated to
BTF-style map definitions__uint(type, BPF_MAP_TYPE_RINGBUF)
— sets the map to be a BPF ring buffer.
More map types can be seen by viewing the uapi/linux/bpf.h
file__uint(max_entries, 512 * 1024)
— sets the maximum size of the ring buffer
to be 512 KBNote that __uint(name, val)
is a macro provided by the bpf/bpf_helpers.h
file.
Let’s take a really quick aside for a second — if we expand the macros used to define our map, it evaluates to the following.
struct {
int (*type)[BPF_MAP_TYPE_RINGBUF];
int (*max_entries)[512 * 1024];
} pipe SEC(".maps");
This seems slightly nonsensical, why does the type
field point to an integer
array of length BPF_MAP_TYPE_RINGBUF
?
So, if you pause for a sec and take further look on the Internet at how BTF-style maps are defined, you might be further confused. This is because, many examples define them using a different syntax, which seems more sensible.
struct {
int type;
int max_entries;
// Some fields omitted...
} pipe SEC(".maps") = {
.type = BPF_MAP_TYPE_RINGBUF,
.max_entries = 512 * 1024,
};
As it turns out, libbpf
is responsible for the conversion from the first
structure into the second one, which is then embedded into the ELF file.
This is what makes our evaluated macro valid, nice.
Ok, aside done.
Well, now that we defined our map we can push data to it using the
bpf_ringbuf_output
helper function, where some user space program
is expected to consume the data asynchronously.
bpf_ringbuf_output(&pipe, &pkt, sizeof(pkt), 0);
And this concludes our kernel space implementation — you can view the full code for our eBPF program here.
Thank you for reading so far! Please let me if you have some feedback or corrections.
If you’re looking for more TC reading, you should check out:
tc
manual pagedirect-action
Now you should be ready to move on to Part 3, where we implement the Go program which
reads from the pipe
map and performs the latency calculation.
© 2024-2025 Nadav Rahimi