The challenge of the network

I can’t connect to the instance.

The Ubuntu precise and raring images I had downloaded from the Ubuntu website worked just fine. I was able to boot them cleanly inside of our new OpenStack deployment and ssh into them. But when I tried to launch a CentOS 6.4 instance, something went wrong. It booted up just fine, but the networking wasn’t working properly.

The guest’s console log revealed that the instance was not able to obtain an IP address via DHCP. Doing a tcpdump on the vif of the compute host confirmed that the DHCP reply packet was reaching it.

# tcpdump -i vnet0 -n port 67 or port 68
tcpdump: WARNING: vnet0: no IPv4 address assigned
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on vnet0, link-type EN10MB (Ethernet), capture size 65535 bytes
19:13:30.699937 IP 10.40.0.2.68 > 10.40.0.1.67: BOOTP/DHCP, Request from fa:16:3e:6b:d3:44, length 300
19:13:30.700445 IP 10.40.0.1.67 > 10.40.0.2.68: BOOTP/DHCP, Reply, length 334

I tried doing the equivalent tcpdump command inside of the instance
# tcpdump -i eth0 -n port 67 or port 68

I saw the DHCP request packets go out, but I didn’t see the DHCP reply packet. Somehow, that packet wasn’t making it across from the host to the guest.

Some Googling revealed that a DHCP failure may occur because of a missing checksum in one of the DHCP packets, and adding a rule to the iptables mangle table can resolve the issue:

iptables -A POSTROUTING -t mangle -p udp --dport bootpc -j CHECKSUM --checksum-fill

I added this rule to the compute host and to the network controller, but the problem remained.

If the DHCP reply packet was getting all of the way to the vif, it seemed like the overall OpenStack setup was ok. Brian had created this CentOS image manually, maybe there was something wrong with it? (It’s always the other guy’s fault, right?) I tried downloading another CentOS image from the links in the rackerjoe/oz-image-build github repository: same problem. I tried creating CentOS a CentOS 6.4 image from scratch, both manually and using Oz. All had the same issue: they booted just fine, but they weren’t able to get IP addresses via DHCP.

There was a suspicious message in the instance boot log.

eth0: IPv6 duplicate address fe80::f816:3eff:fe72:f86d detected!

I tried disabling IPv6 inside of the image, and turning off hairpin mode on the compute host, but that didn’t do it either. I tried turning virtio on and off, and loading and unloading the vhost_net module in the compute host. No effect.

I started trying to do some layer 2 connectivity testing. First, let’s make sure that arping is working. There’s an ubuntu instance running at 10.40.0.2. Let’s try to arping it from the network controller, c220-1:

root@c220-1:~# arping 10.40.0.2 -I br100
ARPING 10.40.0.2 from 10.40.0.1 br100
Unicast reply from 10.40.0.2 [FA:16:3E:3F:19:9A] 1.036ms
Unicast reply from 10.40.0.2 [FA:16:3E:3F:19:9A] 0.981ms
Unicast reply from 10.40.0.2 [FA:16:3E:3F:19:9A] 0.970ms

It’s working, I’m getting replies. Next, statically configure the CentOS instance with an IP address of 10.40.0.5 and try to arping it. The compute host is c220-2, we’ll do a tcpdump on the vif at the same time.

root@c220-1:~# arping 10.40.0.5 -I br100
ARPING 10.40.0.5 from 10.40.0.1 br100

root@c220-2:~# tcpdump -i vnet1 arp
tcpdump: WARNING: vnet1: no IPv4 address assigned
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on vnet1, link-type EN10MB (Ethernet), capture size 65535 bytes
15:30:08.150634 ARP, Request who-has 10.40.0.5 (Broadcast) tell 10.40.0.1, length 46
15:30:09.150705 ARP, Request who-has 10.40.0.5 (Broadcast) tell 10.40.0.1, length 46
15:30:10.150790 ARP, Request who-has 10.40.0.5 (Broadcast) tell 10.40.0.1, length 46

The tcpdump output confirmed that the packets were reaching the vif, but there was no response.

Now let’s try putting an IP address on br100 of the compute host, c220-2, and try to arping the guest.

root@c220-2:~# tcpdump -i vnet1 arp
tcpdump: WARNING: vnet1: no IPv4 address assigned
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on vnet1, link-type EN10MB (Ethernet), capture size 65535 bytes
16:38:19.313931 ARP, Request who-has 10.40.0.5 tell 10.40.0.100, length 28
16:38:19.314133 ARP, Reply 10.40.0.5 is-at fa:16:3e:69:44:36 (oui Unknown), length 28

It works! This is peculiar. I can arping from the host to the guest, but I can’t arping from the network controller to the guest. I know that the packets are getting across the switch, because I see them when using tcpdump in the compute host.

But, why do the packets have different lengths in those two scenarios? If I look at the packet before it leaves the network node, it has a length of 28 bytes. When it arrives at the vif, it has 46 bytes.

root@c220-1:~# tcpdump -i eth1 arp
tcpdump: WARNING: eth1: no IPv4 address assigned
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth1, link-type EN10MB (Ethernet), capture size 65535 bytes
17:13:11.124579 ARP, Request who-has 10.40.0.5 (Broadcast) tell 10.30.0.131, length 28
root@c220-2:~# tcpdump -i vnet1 arp
tcpdump: WARNING: vnet1: no IPv4 address assigned
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on vnet1, link-type EN10MB (Ethernet), capture size 65535 bytes
17:13:11.129685 ARP, Request who-has 10.40.0.5 (Broadcast) tell 10.30.0.131, length 46

Is it because of VLAN tagging? I’m running flat mode, there shouldn’t be any VLAN tags added to the packets. Maybe it’s a switch configuration issue?

I check the configuration of the switch, a Cisco Nexus 3000. From what I can tell, it’s not configured to do VLAN tagging. There’s only one VLAN configured, and that’s the native VLAN. I’ve got vlan dto1q tag native disabled. I ask on Server Fault, and it turns out that 46 bytes is the minimum user data permitted in an Ethernet packet, so it’s normal that padding is added before the packet gets sent over the network.

But, it seems like the problem only occurs when that DHCP packet crosses the switch. What happens if I spoof the ARP request packet, but send it from the compute host instead of the network controller? More googling reveals a command-line tool called “packit” that can spoof packets like this.

When I use packit to spoof the network controller ARP request, the CentOS guest replies properly.

Real:

root@c220-1:~# arping -c1 -I br100 10.40.0.5

root@c220-2:~# tcpdump -ennvvXSs 1514 arp -i br100
tcpdump: listening on br100, link-type EN10MB (Ethernet), capture size 1514 bytes
23:53:11.398148 54:78:1a:86:50:c9 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: Ethernet (len 6), IPv4 (len 4), Request who-has 10.40.0.5 (ff:ff:ff:ff:ff:ff) tell 10.40.0.1, length 46
0x0000: 0001 0800 0604 0001 5478 1a86 50c9 0a28 ........Tx..P..(
0x0010: 0001 ffff ffff ffff 0a28 0005 0000 0000 .........(......
0x0020: 0000 0000 0000 0000 0000 dac7 07ed ..............

Spoofed:

root@c220-2:~# packit -t ARP -c1 -i br100 -A 1 -y 10.40.0.5 -Y ff:ff:ff:ff:ff:ff -x 10.40.0.1 -X 54:78:1a:86:50:c9 -p '0x 00 00 00 00 00 00 00 00 00 00 00 00 00 00 da c7 07 ed' -e 54:78:1a:86:50:c9

root@c220-2# tcpdump -ennvvXSs 1514 arp -i br100
tcpdump: listening on br100, link-type EN10MB (Ethernet), capture size 1514 bytes
23:55:55.628147 54:78:1a:86:50:c9 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: Ethernet (len 6), IPv4 (len 4), Request who-has 10.40.0.5 (ff:ff:ff:ff:ff:ff) tell 10.40.0.1, length 46
0x0000: 0001 0800 0604 0001 5478 1a86 50c9 0a28 ........Tx..P..(
0x0010: 0001 ffff ffff ffff 0a28 0005 0000 0000 .........(......
0x0020: 0000 0000 0000 0000 0000 dac7 07ed ..............

23:55:55.628474 fa:16:3e:69:44:36 > 54:78:1a:86:50:c9, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Reply 10.40.0.5 is-at fa:16:3e:69:44:36, length 28
0x0000: 0001 0800 0604 0002 fa16 3e69 4436 0a28 ..........>iD6.(
0x0010: 0005 5478 1a86 50c9 0a28 0001 ..Tx..P..(..

The real and spoofed packets appear to be identical. The only difference is that one originated from the network controller, and the other originated from the compute host. And, yet, different responses. Maybe the packets are somehow different, but aren’t being revealed by tcpdump?

Brian’s convinced that it’s a VLAN issue, and that tcpdump isn’t telling the whole story. I try loading the 8021q module inside the guest. When I do this, and I configure a static IP address, then networking works! I can ping from the network controller to the guest. But DHCP still isn’t working.

Next, I try scapy for doing the spoofing. I write a quick scapy script that runs on the compute host, listens for the DHCP packets sent by the network controller, and re-transmits them. I know there are two DHCP packets that will be sent by the controller (DHCPOFFER, DHCPACK), so I do this twice:

from scapy.all import *

def lfilter(x):
  return x.haslayer(BOOTP) and x.sport==67 and x.dport==68 and x.src=='54:78:1a:86:50:c9'

conf.iface = 'vnet0'

for i in range(2):
  x = sniff(iface='vnet0', lfilter=lfilter, count=1)
  sendp(x, iface="vnet0")

When the script runs, the instance revceives an IP address! I’m just sniffing the packet and transmitting it again, without modifying it. Why would this work?

On the openstack-operators, Joe suggests doing:

tcpdump -i eth0 -XX -vv -e

on the CentOS guest instance, with 8021q module not loaded and loaded.

This is the first time I’ve run tcpdump inside of the instance without restricting the output by, say, port, or protocol. I don’t see any difference between when the module is loaded or not, but I do see this:

14:29:22.906758 54:78:1a:86:50:c9 (oui Unknown) > Broadcast, ethertype 802.1Q (0x8100), length 64: vlan 0, p 0, ethertype ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.40.0.5 (Broadcast) tell 10.40.0.1, length 46
0x0000: ffff ffff ffff 5478 1a86 50c9 8100 0000 ......Tx..P.....
0x0010: 0806 0001 0800 0604 0001 5478 1a86 50c9 ..........Tx..P.
0x0020: 0a28 0001 ffff ffff ffff 0a28 0005 0000 .(.........(....
0x0030: 0000 0000 0000 0000 0000 0000 dac7 07ed .................

It says “vlan 0, p 0”. What’s VLAN 0? If I do the same tcpdump on the compute host, it makes no mention of “vlan 0, p0”.

It turns out that these are 802.1p packets, which have priority information in them: something was adding these priority tags when the packets were moving from the network controller to the compute hosts. When using 802.1p, if there’s no VLAN tag, then the convention is to put 0 there. The Linux kernel running in the host (3.2.0) handles these properly, but apparently the Linux kernel in the guest (2.6.32) doesn’t handle it.

My suspicion is that it was the network interface cards, Cisco UCS P81E virtual interface cards, that were adding these tags when they received the packets. Apparently, these cards are configured to modify the priority field of received packets by default.

In the end, I switched the OpenStack configuration VLAN manager so that the compute host would explicitly remove VLAN tags before passing the packet before passing them in to the guest. It resolved the issue.

This was a rare case, a problem that arose because of the combination of the host kernel, guest kernel, and the NIC configuration. But it illustrates how difficult it is to track down OpenStack networking problems, and how hard it can be to assist someone who cries out, “Help, I can’t connect to my instance!”

Software Analysis

At McGill University, the computer engineering program evolved out of the electrical engineering program, so it was very EE-oriented. I was required to take four separate courses that involved (analog) circuit analysis: fundamental of electrical engineering, circuit analysis, electronic circuits I, and electronic circuits II.

I’m struggling to think of what the equivalent of “circuit analysis” would be for software engineering. To keep the problem structure the same as circuit analysis, it would be something like: Given a (simplified model of?) a computer program, for a given input, what will the program output?

It’s hard to imagine even a single course in a software engineering program dedicated entirely to this type of manual “program analysis”. And yet, reading code is so important to what we do, and remains such a difficult task.

No Country for IT

Matt Welsh suggests that systems researchers should work on an escape from configuration hell. I’ve felt some of the pain he describes while managing a small handful of servers.

Coming from a software engineering background, I would have instinctively classified the problem of misconfiguration as a software engineering problem instead of a systems one. But, really it’s an IT problem more than anything else. And therein lies the tragedy:  IT is not considered a respectable area of research in the computer science academic community.