Research: Securing Linux with a Faster and Scalable IPtables
If you haven’t found the trade-offs, you haven’t looked hard enough.
A perfect illustration is the research paper under review, Securing Linux with a Faster and Scalable Iptables. Before diving into the paper, however, some background might be good. Consider the situation where you want to filter traffic being transmitted to and by a virtual workload of some kind, as shown below.
To move a packet from the user space into the kernel, the packet itself must be copied into some form of memory that processes on “both sides of the divide” can read, then the entire state of the process (memory, stack, program execution point, etc.) must be pushed into a local memory space (stack), and control transferred to the kernel. This all takes time and power, of course.
In the current implementation of packet filtering, netfilter performs the majority of filtering within the kernel, while iptables acts as a user frontend as well as performing some filtering actions in the user space. Packets being pushed from one interface to another must make the transition between the user space and the kernel twice. Interfaces like XDP aim to make the processing of packets faster by shortening the path from the virtual workload to the PHY chipset.
What if, instead of putting the functionality of iptables in the user space you could put it in the kernel space? This would make the process of switching packets through the device faster, because you would not need to pull packets out of the kernel into a user space process to perform filtering.
But there are trade-offs. According to the authors of this paper, there are three specific challenges that need to be addressed. First, users expect iptables filtering to take place in the user process. If a packet is transmitted between virtual workloads, the user expects any filtering to take place before the packet is pushed to the kernel to be carried across the bridge, and back out into user space to the second process, Second, a second process, contrack, checks the existence of a TCP connection, which iptables then uses to determine whether a packet that is set to drop because there no existing connection. This give iptables the ability to do stateful filtering. Third, classification of packets is very expensive; classifying packets could take too much processing power or memory to be done efficiently in the kernel.
To resolve these issues, the authors of this paper propose using an in-kernel virtual machine, or eBPF. They design an architecture which splits iptables into to pipelines, and ingress and egress, as shown in the illustration taken from the paper below.
As you can see, the result is… complex. Not only are there more components, with many more interaction surfaces, there is also the complexity of creating in-kernel virtual machines—remembering that virtual machines are designed to separate out processing and memory spaces to prevent cross-application data leakage and potential single points of failure.
That these problems are solvable is not in question—the authors describe how they solved each of the challenges they laid out. The question is: are the trade-offs worth it?
The bottom line: when you move filtering from the network to the host, you are not moving the problem from a place where it is less complex. You may make the network design itself less complex, and you may move filtering closer to the application, so some specific security problems are easier to solve, but the overall complexity of the system is going way up—particularly if you want a high performance solution.
Your points about interaction surfaces are all valid, but there’s some factual inaccuracies about iptables and eBPF.
iptables actually does filtering in kernel-space. iptables (the command) is a user-space tool to configure netfilter (the kernel side of things) which does the actual filtering in kernel space. iptables basically does have an ingress/egress pipeline today, so it looks like they’re more or less replicating it in eBPF.
My understanding of eBPF is that it exists to optimize the same processing pipeline and provide a way to safely introduce code changes to the pipeline. It’s not a complex VM like the Java Virtual Machine — it’s not even a turing-complete language because it does not support loops. Languages and VMs similar to eBPF typically impose a strict separation between code and the data it’s operating on so it’s hard to inject code into a program stack for example. I doubt the language even supports heap-based memory allocation. In terms of VM foot-print, it’s not going to be nearly as heavy as the JVM.
I’m not arguing with the results of putting filtering into the network hardware — it is likely going to be much faster, but I do think that eBPF filtering is an advance worth giving more credit. The interaction surfaces might actually be cleaner and easier to work on with eBPF instead of a set of compiled C modules that get dynamically loaded into the Kernel.
I’m not entirely certain iptables is “just” a user interface for netfilter. The authors of the paper make much of making certain they are placing the filtering hooks in the same place as iptables does today; if filtering is entirely a kernel-side affair, then it would seem this would not be required. Further, if you have multiple namespaces, and all filtering takes place in the kernel, then you would need to have a single, shared, filter set across all namespaces–yet this is not true.
What I would say is that netfilter acts on the filters set by iptables, but that iptables does a lot of classification and other work in the user space. In other words, the “complete filtering process” is spread across user and kernel space. Otherwise there wouldn’t be any attempt at gaining performance by moving more of the functionality to the kernel.
The article abstract says they’re using eBPF and XDP hooks, not iptables hooks. At least for the eBPF hooks, my understanding is that those are largely there to load a program into the kernel. They may include some facilities to pull packets into user-space, but I don’t think that’s the primary goal.
Specific to your point — “if you have multiple namespaces, and all filtering takes place in the kernel, then you would need to have a single, shared, filter set across all namespaces”
I actually don’t know why this constraint is necessary. Namespaces allow you to create a separate network devices, IP addresses, IP routing tables, etc which are all in-kernel constructs. Why would that include an IP stack but exclude netfilter?
So, I could be wrong and/or things could have changed, but from my years of sysadmin past it was always an in-kernel thing. Way back when I built my own kernels, netfilter was always an option and it was a big deal when it changed from ipchains to iptables. I remember selecting this option in the kernel as a run-time loadable module (via modprobe) or compile it directly in. yeah, I was one of those that recompiled their kernel for “performance reasons” back in the day.
The kernel source tree (https://github.com/torvalds/linux/tree/master/net/netfilter) contains the netfilter code. I don’t how/where that code makes its way into the packet path and I’m not going to dig it up, but the code has quite a few calls like kvzalloc() that you wouldn’t use in a userland process. The kernel source tree also typically doesn’t include very many userland tools.
The iptables project page (https://netfilter.org/projects/iptables/index.html) says ” iptables is the userspace command line program used to *configure* the Linux 2.4.x and later packet filtering ruleset.”