Overview
In this final post of the series we take a look at Network namespaces. As we hinted at during the intro post, a network namespace isolates network related resources - a process running in a distinct network namespace has its own networking devices, routing tables, firewall rules etc. We can see this in action immediately by inspecting our current network environment.
The ip Command
Since we will be interacting with network devices in this post, we will re-enforce the superuser requirements that we relaxed in the previous posts. From now on, we will assume that both
ip
andisolate
are being run withsudo
.
$ ip link list
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
link/ether 00:0c:29:96:2e:3b brd ff:ff:ff:ff:ff:ff
Star of the show here is the ip
command - the Swiss Army Knife for networking in Linux - and we will use it extensively in this post.
Right now we have just run the link list
subcommand to show us what networking devices are currently available in the system (here we have lo
, the loopback interface and ens33
an ethernet LAN interface).
As with all other namespaces, the system starts with an initial network namespace within to which all processes belong unless specified otherwise. Running this ip link list
command as-is gives us the networking devices owned by the initial namespace (since our shell and the ip
command belong to this namespace).
Named Network Namespaces
Let’s create a new network namespace:
$ ip netns add coke
$ ip netns list
coke
Again, we’ve used the ip
command. Its netns
subcommand allows us to play with network namespaces - for example we can create new network namespaces using the add
subcommand of netns
and use list
to, well, list them.
You might notice that list
only returned our newly created namespace - shouldn’t it return at least two, the other one being the initial namespace that we mentioned earlier?
The reason for this is that ip
creates what is called a named network namespace, which simply is a network namespace that is identifiable by a unique name (in our case coke
).
Only named network namespaces are shown via list
and the initial network namespace isn’t named.
Named network namespaces are easier to get a hold of. For example, a file is created for each named network namespace under the /var/run/netns
folder and can be used by a process that wants to switch to its namespace. Another property of named network namespaces is that they can exist without having any process as a member - unlike non-named ones that will be deleted once all member processes exit.
Now that we have a child network namespace, we can see networking from its perspective.
We will be using the command prompt
C$
to emphasize a shell running inside a child network namespace.
$ ip netns exec coke bash
C$ ip link list
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
The exec $namespace $command
subcommand executes $command
in the named network namespace $namespace
.
Here we ran a shell inside the coke
namespace and listed the network devices available. We can see that at least our ens33
device has disappeared.
The only device that shows up is loopback and even that interface is down.
C$ ping 127.0.0.1
connect: Network is unreachable
We should be used to this by now, the default setup for namespaces are usually very strict. For network namespaces as we can see, no devices except loopback
will be present.
We can bring the loopback
interface up without any paperwork though:
C$ ip link set dev lo up
C$ ping 127.0.0.1
PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.034 ms
...
Network Isolation
We’re already starting to see that by running a process in a nested network namespace like coke
, we can be sure that it is isolated from the rest of the system as far as networking is concerned.
Our shell process running in coke
can only communicate via loopback
- this means that it can only communicate with processes that are also members of the coke
namespace but currently there are no other member processes (and in the name of isolation, we would like that it remains that way) so it’s a bit lonely.
Let’s try to relax this isolation a bit, we will create a tunnel through which processes in coke
can communicate with processes in our initial namespace.
Now, any network communication has to go via some network device and a device can exist in exactly one network namespace at any given time so communication between any two processes in different namespaces must go via at least two network devices - one in each network namespace.
Veth Devices
We will use a virtual ethernet network device (or veth
for short) to fulfill our need.
Veth devices are always created as a pair of devices in a tunnel-like fashion so that messages written to the device on one end comes out of the device on the other end.
You might guess that we could easily have one end in the initial network namespace and the other in our child network namespace and have all inter-network-namespace communication go via the respective veth end device (and you would be correct).
# Create a veth pair (veth0 <=> veth1)
$ ip link add veth0 type veth peer name veth1
# Move the veth1 end to the new namespace
$ ip link set veth1 netns coke
# List the network devices from inside the new namespace
C$ ip link list
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
7: veth1@if8: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether ee:16:0c:23:f3:af brd ff:ff:ff:ff:ff:ff link-netnsid 0
Our veth1
device now shows up in the coke
namespace. But to make the veth pair functional, we need to give them both IP addresses and bring the interfaces up. We will do this in their respective network namespace.
# In the initial namespace
$ ip addr add 10.1.1.1/24 dev veth0
$ ip link set dev veth0 up
# In the coke namespace
C$ ip addr add 10.1.1.2/24 dev veth1
C$ ip link set dev veth1 up
C$ ip addr show veth1
7: veth1@if8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether ee:16:0c:23:f3:af brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 10.1.1.2/24 scope global veth1
valid_lft forever preferred_lft forever
inet6 fe80::ec16:cff:fe23:f3af/64 scope link
valid_lft forever preferred_lft forever
We should see that veth1
is up and has our assigned address 10.1.1.2
- the same should happen for veth0
in the initial namespace.
Now we should be able to do an inter-namespace ping between two processes running in both namespaces.
$ ping -I veth0 10.1.1.2
PING 10.1.1.2 (10.1.1.2) 56(84) bytes of data.
64 bytes from 10.1.1.2: icmp_seq=1 ttl=64 time=0.041 ms
...
C$ ping 10.1.1.1
PING 10.1.1.1 (10.1.1.1) 56(84) bytes of data.
64 bytes from 10.1.1.1: icmp_seq=1 ttl=64 time=0.067 ms
...
Implementation
The source code for this post can be found here.
As usual, we will now try to replicate what we’ve seen so far in code. Specifically, we will need to do the following:
- Execute the command within a new network namespace.
- Create a veth pair (veth0 <=> veth1).
- Move the veth1 device to the new namespace.
- Assign IP addresses to both devices and bring them up.
Step 1
is straight-forward, we create our command process in a new network namespace by adding the CLONE_NEWNET
flag to clone
:
int clone_flags = SIGCHLD | CLONE_NEWUTS | CLONE_NEWUSER | CLONE_NEWNS | CLONE_NEWNET;
Netlink
For the remaining steps, we will primarily be using the Netlink interface to communicate with Linux.
Netlink is primarily used for communication between regular applications (like isolate
) and the Linux kernel.
It exposes an API on top of sockets, based on a protocol that determines message structure and content.
Using this protocol we can send messages that Linux receives and translates to requests - like create a veth pair with names veth0 and veth1.
Let’s start by creating our netlink socket. In it, we specify that we want to use the NETLINK_ROUTE
protocol - this protocol covers implementations for network routing and device management.
int create_socket(int domain, int type, int protocol)
{
int sock_fd = socket(domain, type, protocol);
if (sock_fd < 0)
die("cannot open socket: %m\n");
return sock_fd;
}
int sock_fd = create_socket(
PF_NETLINK, SOCK_RAW | SOCK_CLOEXEC, NETLINK_ROUTE);
Netlink Message Format
A Netlink message is a 4-byte aligned block of data containing a header (struct nlmsghdr
) and a payload. The header format is described here. The Network Interface Service (NIS) Module specifies the format (struct ifinfomsg
) that payload related to network interface administration must begin with.
Our request will be represented by the following C
struct:
#define MAX_PAYLOAD 1024
struct nl_req {
struct nlmsghdr n; // Netlink message header
struct ifinfomsg i; // Payload starting with NIS module info
char buf[MAX_PAYLOAD]; // Remaining payload
};
Netlink Attributes
The NIS module requires the payload to be encoded as Netlink attributes.
Attributes provide a way to segment the payload into subsections.
An attribute has a type and a length in addition to a payload containing its actual data.
The Netlink message payload will be encoded as a list of attributes (where any such attribute can in turn have nested attributes) and we will have some helper functions to populate it with attributes.
In code, an attribute is represented by the rtattr
struct in the linux/rtnetlink.h
header file as:
struct rtattr {
unsigned short rta_len;
unsigned short rta_type;
};
rta_len
is the length of the attribute’s payload which immediately follows the rt_attr
struct in memory (i.e the next rta_len
bytes).
How the content of this payload is interpreted is dictated by rta_type
and possible values are entirely dependent on the receiver implementation and the request being sent.
In an attempt to put this all together, let’s see how isolate
makes a netlink request to create veth pair with the following function create_veth
that fulfills step 2
:
// ip link add ifname type veth ifname name peername
void create_veth(int sock_fd, char *ifname, char *peername)
{
__u16 flags =
NLM_F_REQUEST // This is a request message
| NLM_F_CREATE // Create the device if it doesn't exist
| NLM_F_EXCL // If it already exists, do nothing
| NLM_F_ACK; // Reply with an acknowledgement or error
// Initialise request message.
struct nl_req req = {
.n.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg)),
.n.nlmsg_flags = flags,
.n.nlmsg_type = RTM_NEWLINK, // This is a netlink message
.i.ifi_family = PF_NETLINK,
};
struct nlmsghdr *n = &req.n;
int maxlen = sizeof(req);
/*
* Create an attribute r0 with the veth info. e.g if ifname is veth0
* then the following will be appended to the message
* {
* rta_type: IFLA_IFNAME
* rta_len: 5 (len(veth0) + 1)
* data: veth0\0
* }
*/
addattr_l(n, maxlen, IFLA_IFNAME, ifname, strlen(ifname) + 1);
// Add a nested attribute r1 within r0 containing iface info
struct rtattr *linfo =
addattr_nest(n, maxlen, IFLA_LINKINFO);
// Specify the device type is veth
addattr_l(&req.n, sizeof(req), IFLA_INFO_KIND, "veth", 5);
// Add another nested attribute r2
struct rtattr *linfodata =
addattr_nest(n, maxlen, IFLA_INFO_DATA);
// This next nested attribute r3 one contains the peer name e.g veth1
struct rtattr *peerinfo =
addattr_nest(n, maxlen, VETH_INFO_PEER);
n->nlmsg_len += sizeof(struct ifinfomsg);
addattr_l(n, maxlen, IFLA_IFNAME, peername, strlen(peername) + 1);
addattr_nest_end(n, peerinfo); // end r3 nest
addattr_nest_end(n, linfodata); // end r2 nest
addattr_nest_end(n, linfo); // end r1 nest
// Send the message
send_nlmsg(sock_fd, n);
}
As we can see, we need to be precise about what we send here - we had to encode the message in the exact way it will be interpreted by the kernel implementation and here it took us 3 nested attributes to do so.
I’m sure this is documented somewhere even though I was unable to find it after some googling - I mostly figured this out via strace and the ip
command source code.
Next, for step 3
, is a method that, given an interface name ifname
and a network namespace file descriptor netns
, moves the device associated with that interface to the specified network namespace.
// $ ip link set veth1 netns coke
void move_if_to_pid_netns(int sock_fd, char *ifname, int netns)
{
struct nl_req req = {
.n.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg)),
.n.nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK,
.n.nlmsg_type = RTM_NEWLINK,
.i.ifi_family = PF_NETLINK,
};
addattr_l(&req.n, sizeof(req), IFLA_NET_NS_FD, &netns, 4);
addattr_l(&req.n, sizeof(req), IFLA_IFNAME,
ifname, strlen(ifname) + 1);
send_nlmsg(sock_fd, &req.n);
}
After creating the veth pair and moving one end to our target network namespace, step 4
has us assigning both end devices IP addresses and bringing their interfaces up.
For that we have a helper function if_up
which, given an interface name ifname
and ip address ip
, assigns ip
to the device ifname
and brings it up.
For brevity we do not show those here but they can be found here instead.
Finally, we bring these methods together to prepare our network namespace for our command process.
static void prepare_netns(int child_pid)
{
char *veth = "veth0";
char *vpeer = "veth1";
char *veth_addr = "10.1.1.1";
char *vpeer_addr = "10.1.1.2";
char *netmask = "255.255.255.0";
// Create our netlink socket
int sock_fd = create_socket(
PF_NETLINK, SOCK_RAW | SOCK_CLOEXEC, NETLINK_ROUTE);
// ... and our veth pair veth0 <=> veth1.
create_veth(sock_fd, veth, vpeer);
// veth0 is in our current (initial) namespace
// so we can bring it up immediately.
if_up(veth, veth_addr, netmask);
// ... veth1 will be moved to the command namespace.
// To do that though we need to grab a file descriptor
// to and enter the commands namespace but first we must
// remember our current namespace so we can get back to it
// when we're done.
int mynetns = get_netns_fd(getpid());
int child_netns = get_netns_fd(child_pid);
// Move veth1 to the command network namespace.
move_if_to_pid_netns(sock_fd, vpeer, child_netns);
// ... then enter it
if (setns(child_netns, CLONE_NEWNET)) {
die("cannot setns for child at pid %d: %m\n", child_pid);
}
// ... and bring veth1 up
if_up(vpeer, vpeer_addr, netmask);
// ... before moving back to our initial network namespace.
if (setns(mynetns, CLONE_NEWNET)) {
die("cannot restore previous netns: %m\n");
}
close(sock_fd);
}
Then we can call prepare_netns
right after we’re done setting up the user namespace.
...
// Get the writable end of the pipe.
int pipe = params.fd[1];
prepare_userns(cmd_pid);
prepare_netns(cmd_pid);
// Signal to the command process we're done with setup.
...
Let’s try it out!
$ sudo ./isolate sh
===========sh============
$ ip link list
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
31: veth1@if32: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue state UP qlen 1000
link/ether 2a:e8:d9:df:b4:3d brd ff:ff:ff:ff:ff:ff
# Verify inter-namespace connectivity
$ ping 10.1.1.1
PING 10.1.1.1 (10.1.1.1): 56 data bytes
64 bytes from 10.1.1.1: seq=0 ttl=64 time=0.145 ms