Adopt an op

Here’s a modest proposal: a program to pair up individual OpenStack developers with OpenStack operators to encourage better information flow from ops to devs.

Heres’s how it might work. Operators with production OpenStack deployments would indicate that they would be willing to occasionally host a developer. The participating OpenStack developer would travel to the operator’s site for, say, a day or two, and shadow the operator. The dev would observe things like the kinds of maintenance tasks the op was doing, the kinds of tools they were using to do so, and so on.

After the visit was complete, the dev would write up and publish a report about what they learned, focusing in particular on observed pain points and any surprises that the dev encountered about what the operator did and how they did it. Finally, the dev would submit any relevant usability or other bugs to the relevant projects.

You could call it “Adopt an Op”. Although “Adopt a Dev” is probably more accurate, I think that the emphasis should be on the devs coming to the ops.

Up and running on Rackspace

Rackspace is now running a developer discount, so I thought I’d give them a try. Once I signed up for the account and got my credentials, here’s how I got up and running with the command-line tools. I got this info from Rackspace’s Getting Started guide.

First, install the OpenStack Compute client with rackspace extensions.

sudo pip install rackspace-novaclient

Next, create your openrc file, which will contain environment variables that the client will use to authenticate you against the Rackspace cloud. You’ll need the following information

  1. A valid region. When you’re logged in to your account, you can see the region names. In the U.S., they are:
    • DFW (Dallas)
    • IAD (Northern Virginia)
    • ORD (Chicago)
  2. Your username (you picked this when you created your account)

  3. Your account number (appears in parentheses next to your username when you are logged in to the control panel at http://mycloud.rackspace.com)

  4. Your API key (click on your username in the control panel, then choose “Account Settings”, then “API Key: Show”)

Your openrc file should then look like this (here I’m using IAD as my region):

export OS_AUTH_URL=https://identity.api.rackspacecloud.com/v2.0/
export OS_AUTH_SYSTEM=rackspace
export OS_REGION_NAME=IAD
export OS_USERNAME=<your username>
export OS_TENANT_NAME=<your account number>
export NOVA_RAX_AUTH=1
export OS_PASSWORD=<your API key>
export OS_PROJECT_ID=<your account number>
export OS_NO_CACHE=1

Finally, source your openrc file and start interacting with the cloud. Here’s how I added my public key and booted an Ubuntu 13.04 server:

$ source openrc
$ nova keypair-add lorin --pub-key ~/.ssh/id_rsa.pub
$ nova boot --flavor 2 --image 1bbc5e56-ca2c-40a5-94b8-aa44822c3947 --key_name lorin raring
(wait a while)
$ nova list
+--------------------------------------+--------+--------+-------------------------------------------------------------------------------------+
| ID                                   | Name   | Status | Networks                                                                            |
+--------------------------------------+--------+--------+-------------------------------------------------------------------------------------+
| 7d432f76-491f-4245-b55c-2b15c2878ebb | raring | ACTIVE | public=2001:4802:7800:0001:f3bb:d4fc:ff20:06ab, 162.209.98.198; private=10.176.6.21 |
+--------------------------------------+--------+--------+--------------------------------------------------------------------- 

There were a couple of things that caught me by surprise.

First, nova console-log returns an error:

$ nova console-log raring
ERROR: There is no such action: os-getConsoleOutput (HTTP 400) (Request-ID: req-5ad0092b-6ff1-4233-b6aa-fc0920d42671)

Second, I had to ssh as root to the ubuntu instance, not as the ubuntu user. In fact, the Ubuntu 13.04 image I booted doesn’t seem to have cloud-init installed, which surprised me. I’m not sure how the image is pulling my public key from the metadata service.

EDIT: I can’t reach the metadata service from the instance, so I assume that there is no metadata service running, and that they are injecting the key directly into the filesystem.

Automated DevStack install inside of VirtualBox with Vagrant

If you’re interested in trying out DevStack, I wrote up some scripts for automatically deploying DevStack inside of a VirtualBox virtual machine using Vagrant: devstack-vm.

Assuming you have the prereqs installed, it’s just:

$ git clone https://github.com/lorin/devstack-vm
$ cd devstack-vm
$ chmod 0600 id_vagrant
$ vagrant up

In a few minutes, you’ll have a running version of DevStack, configured with Neutron. You can even reach your instances with floating IPs without having to ssh to the VirtualBox VM first. If you want to automatically boot a Cirros instance and attach a floating IP, just run the included Python script which uses the OpenStack Python bindings:

$ ./boot-cirros.py

Edit: Added a line to chmod the private key

The challenge of the network

I can’t connect to the instance.

The Ubuntu precise and raring images I had downloaded from the Ubuntu website worked just fine. I was able to boot them cleanly inside of our new OpenStack deployment and ssh into them. But when I tried to launch a CentOS 6.4 instance, something went wrong. It booted up just fine, but the networking wasn’t working properly.

The guest’s console log revealed that the instance was not able to obtain an IP address via DHCP. Doing a tcpdump on the vif of the compute host confirmed that the DHCP reply packet was reaching it.

# tcpdump -i vnet0 -n port 67 or port 68
tcpdump: WARNING: vnet0: no IPv4 address assigned
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on vnet0, link-type EN10MB (Ethernet), capture size 65535 bytes
19:13:30.699937 IP 10.40.0.2.68 > 10.40.0.1.67: BOOTP/DHCP, Request from fa:16:3e:6b:d3:44, length 300
19:13:30.700445 IP 10.40.0.1.67 > 10.40.0.2.68: BOOTP/DHCP, Reply, length 334

I tried doing the equivalent tcpdump command inside of the instance
# tcpdump -i eth0 -n port 67 or port 68

I saw the DHCP request packets go out, but I didn’t see the DHCP reply packet. Somehow, that packet wasn’t making it across from the host to the guest.

Some Googling revealed that a DHCP failure may occur because of a missing checksum in one of the DHCP packets, and adding a rule to the iptables mangle table can resolve the issue:

iptables -A POSTROUTING -t mangle -p udp --dport bootpc -j CHECKSUM --checksum-fill

I added this rule to the compute host and to the network controller, but the problem remained.

If the DHCP reply packet was getting all of the way to the vif, it seemed like the overall OpenStack setup was ok. Brian had created this CentOS image manually, maybe there was something wrong with it? (It’s always the other guy’s fault, right?) I tried downloading another CentOS image from the links in the rackerjoe/oz-image-build github repository: same problem. I tried creating CentOS a CentOS 6.4 image from scratch, both manually and using Oz. All had the same issue: they booted just fine, but they weren’t able to get IP addresses via DHCP.

There was a suspicious message in the instance boot log.

eth0: IPv6 duplicate address fe80::f816:3eff:fe72:f86d detected!

I tried disabling IPv6 inside of the image, and turning off hairpin mode on the compute host, but that didn’t do it either. I tried turning virtio on and off, and loading and unloading the vhost_net module in the compute host. No effect.

I started trying to do some layer 2 connectivity testing. First, let’s make sure that arping is working. There’s an ubuntu instance running at 10.40.0.2. Let’s try to arping it from the network controller, c220-1:

root@c220-1:~# arping 10.40.0.2 -I br100
ARPING 10.40.0.2 from 10.40.0.1 br100
Unicast reply from 10.40.0.2 [FA:16:3E:3F:19:9A] 1.036ms
Unicast reply from 10.40.0.2 [FA:16:3E:3F:19:9A] 0.981ms
Unicast reply from 10.40.0.2 [FA:16:3E:3F:19:9A] 0.970ms

It’s working, I’m getting replies. Next, statically configure the CentOS instance with an IP address of 10.40.0.5 and try to arping it. The compute host is c220-2, we’ll do a tcpdump on the vif at the same time.

root@c220-1:~# arping 10.40.0.5 -I br100
ARPING 10.40.0.5 from 10.40.0.1 br100

root@c220-2:~# tcpdump -i vnet1 arp
tcpdump: WARNING: vnet1: no IPv4 address assigned
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on vnet1, link-type EN10MB (Ethernet), capture size 65535 bytes
15:30:08.150634 ARP, Request who-has 10.40.0.5 (Broadcast) tell 10.40.0.1, length 46
15:30:09.150705 ARP, Request who-has 10.40.0.5 (Broadcast) tell 10.40.0.1, length 46
15:30:10.150790 ARP, Request who-has 10.40.0.5 (Broadcast) tell 10.40.0.1, length 46

The tcpdump output confirmed that the packets were reaching the vif, but there was no response.

Now let’s try putting an IP address on br100 of the compute host, c220-2, and try to arping the guest.

root@c220-2:~# tcpdump -i vnet1 arp
tcpdump: WARNING: vnet1: no IPv4 address assigned
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on vnet1, link-type EN10MB (Ethernet), capture size 65535 bytes
16:38:19.313931 ARP, Request who-has 10.40.0.5 tell 10.40.0.100, length 28
16:38:19.314133 ARP, Reply 10.40.0.5 is-at fa:16:3e:69:44:36 (oui Unknown), length 28

It works! This is peculiar. I can arping from the host to the guest, but I can’t arping from the network controller to the guest. I know that the packets are getting across the switch, because I see them when using tcpdump in the compute host.

But, why do the packets have different lengths in those two scenarios? If I look at the packet before it leaves the network node, it has a length of 28 bytes. When it arrives at the vif, it has 46 bytes.

root@c220-1:~# tcpdump -i eth1 arp
tcpdump: WARNING: eth1: no IPv4 address assigned
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth1, link-type EN10MB (Ethernet), capture size 65535 bytes
17:13:11.124579 ARP, Request who-has 10.40.0.5 (Broadcast) tell 10.30.0.131, length 28
root@c220-2:~# tcpdump -i vnet1 arp
tcpdump: WARNING: vnet1: no IPv4 address assigned
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on vnet1, link-type EN10MB (Ethernet), capture size 65535 bytes
17:13:11.129685 ARP, Request who-has 10.40.0.5 (Broadcast) tell 10.30.0.131, length 46

Is it because of VLAN tagging? I’m running flat mode, there shouldn’t be any VLAN tags added to the packets. Maybe it’s a switch configuration issue?

I check the configuration of the switch, a Cisco Nexus 3000. From what I can tell, it’s not configured to do VLAN tagging. There’s only one VLAN configured, and that’s the native VLAN. I’ve got vlan dto1q tag native disabled. I ask on Server Fault, and it turns out that 46 bytes is the minimum user data permitted in an Ethernet packet, so it’s normal that padding is added before the packet gets sent over the network.

But, it seems like the problem only occurs when that DHCP packet crosses the switch. What happens if I spoof the ARP request packet, but send it from the compute host instead of the network controller? More googling reveals a command-line tool called “packit” that can spoof packets like this.

When I use packit to spoof the network controller ARP request, the CentOS guest replies properly.

Real:

root@c220-1:~# arping -c1 -I br100 10.40.0.5

root@c220-2:~# tcpdump -ennvvXSs 1514 arp -i br100
tcpdump: listening on br100, link-type EN10MB (Ethernet), capture size 1514 bytes
23:53:11.398148 54:78:1a:86:50:c9 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: Ethernet (len 6), IPv4 (len 4), Request who-has 10.40.0.5 (ff:ff:ff:ff:ff:ff) tell 10.40.0.1, length 46
0x0000: 0001 0800 0604 0001 5478 1a86 50c9 0a28 ........Tx..P..(
0x0010: 0001 ffff ffff ffff 0a28 0005 0000 0000 .........(......
0x0020: 0000 0000 0000 0000 0000 dac7 07ed ..............

Spoofed:

root@c220-2:~# packit -t ARP -c1 -i br100 -A 1 -y 10.40.0.5 -Y ff:ff:ff:ff:ff:ff -x 10.40.0.1 -X 54:78:1a:86:50:c9 -p '0x 00 00 00 00 00 00 00 00 00 00 00 00 00 00 da c7 07 ed' -e 54:78:1a:86:50:c9

root@c220-2# tcpdump -ennvvXSs 1514 arp -i br100
tcpdump: listening on br100, link-type EN10MB (Ethernet), capture size 1514 bytes
23:55:55.628147 54:78:1a:86:50:c9 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: Ethernet (len 6), IPv4 (len 4), Request who-has 10.40.0.5 (ff:ff:ff:ff:ff:ff) tell 10.40.0.1, length 46
0x0000: 0001 0800 0604 0001 5478 1a86 50c9 0a28 ........Tx..P..(
0x0010: 0001 ffff ffff ffff 0a28 0005 0000 0000 .........(......
0x0020: 0000 0000 0000 0000 0000 dac7 07ed ..............

23:55:55.628474 fa:16:3e:69:44:36 > 54:78:1a:86:50:c9, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Reply 10.40.0.5 is-at fa:16:3e:69:44:36, length 28
0x0000: 0001 0800 0604 0002 fa16 3e69 4436 0a28 ..........>iD6.(
0x0010: 0005 5478 1a86 50c9 0a28 0001 ..Tx..P..(..

The real and spoofed packets appear to be identical. The only difference is that one originated from the network controller, and the other originated from the compute host. And, yet, different responses. Maybe the packets are somehow different, but aren’t being revealed by tcpdump?

Brian’s convinced that it’s a VLAN issue, and that tcpdump isn’t telling the whole story. I try loading the 8021q module inside the guest. When I do this, and I configure a static IP address, then networking works! I can ping from the network controller to the guest. But DHCP still isn’t working.

Next, I try scapy for doing the spoofing. I write a quick scapy script that runs on the compute host, listens for the DHCP packets sent by the network controller, and re-transmits them. I know there are two DHCP packets that will be sent by the controller (DHCPOFFER, DHCPACK), so I do this twice:

from scapy.all import *

def lfilter(x):
  return x.haslayer(BOOTP) and x.sport==67 and x.dport==68 and x.src=='54:78:1a:86:50:c9'

conf.iface = 'vnet0'

for i in range(2):
  x = sniff(iface='vnet0', lfilter=lfilter, count=1)
  sendp(x, iface="vnet0")

When the script runs, the instance revceives an IP address! I’m just sniffing the packet and transmitting it again, without modifying it. Why would this work?

On the openstack-operators, Joe suggests doing:

tcpdump -i eth0 -XX -vv -e

on the CentOS guest instance, with 8021q module not loaded and loaded.

This is the first time I’ve run tcpdump inside of the instance without restricting the output by, say, port, or protocol. I don’t see any difference between when the module is loaded or not, but I do see this:

14:29:22.906758 54:78:1a:86:50:c9 (oui Unknown) > Broadcast, ethertype 802.1Q (0x8100), length 64: vlan 0, p 0, ethertype ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.40.0.5 (Broadcast) tell 10.40.0.1, length 46
0x0000: ffff ffff ffff 5478 1a86 50c9 8100 0000 ......Tx..P.....
0x0010: 0806 0001 0800 0604 0001 5478 1a86 50c9 ..........Tx..P.
0x0020: 0a28 0001 ffff ffff ffff 0a28 0005 0000 .(.........(....
0x0030: 0000 0000 0000 0000 0000 0000 dac7 07ed .................

It says “vlan 0, p 0”. What’s VLAN 0? If I do the same tcpdump on the compute host, it makes no mention of “vlan 0, p0”.

It turns out that these are 802.1p packets, which have priority information in them: something was adding these priority tags when the packets were moving from the network controller to the compute hosts. When using 802.1p, if there’s no VLAN tag, then the convention is to put 0 there. The Linux kernel running in the host (3.2.0) handles these properly, but apparently the Linux kernel in the guest (2.6.32) doesn’t handle it.

My suspicion is that it was the network interface cards, Cisco UCS P81E virtual interface cards, that were adding these tags when they received the packets. Apparently, these cards are configured to modify the priority field of received packets by default.

In the end, I switched the OpenStack configuration VLAN manager so that the compute host would explicitly remove VLAN tags before passing the packet before passing them in to the guest. It resolved the issue.

This was a rare case, a problem that arose because of the combination of the host kernel, guest kernel, and the NIC configuration. But it illustrates how difficult it is to track down OpenStack networking problems, and how hard it can be to assist someone who cries out, “Help, I can’t connect to my instance!”

Operator fault tolerance

Because “cloud” has become such a buzzword, it’s tempting to dismiss cloud computing as nothing new. But one genuine change is the rise in software designed to work in an environment where hardware failures are expected. The classic example of this trend is the Netflix Chaos Monkey, which tests a software system by initiating random failures. The IT community calls this sort of system “highly available”, whereas the academic community prefers the term “fault tolerant”.

If you plan to deploy a system like an OpenStack cloud, you need to be aware of the failure modes of the system components (disk failures, power failures, networking issues), and ensure that your system can stay functional when these failures occur. However, when you actually deploy OpenStack on real hardware, you quickly discover that the component that is most likely to generate a fault is you, the operator. Because every installation is different, and because OpenStack has so many options, the probability of forgetting an option or specifying the incorrect value in a config file on the initial deployment is approximately one.

And while developers now design software to minimize the impact due to a hardware failure, there is no such notion of minimizing the impact due to an operator failure. This would require asking questions at development time such as: “What will happen if somebody puts ‘eth1’ instead of ‘eth0’ for public_interface in nova.conf? How would they determine what has gone wrong?”

Designing for operator fault tolerance would be a significant shift in thinking, but I would wager that the additional development effort would translate into enormous reductions in operations effort.