An atomic upgrade process for OpenStack compute nodes

I have been working with container technology since September 2014, sorting out how they are useful in the context of OpenStack.  This led to my involvement in the Kolla project, a project to containerize OpenStack as well as Magnum, a project to provide containers as a service.  Containers are super useful as  an upgrade tool for OpenStack, and the main topic of this blog post.

Kolla began life as a project with dependencies on docker and kubernetes.  I wasn’t always certain the kubernetes dependency was necessary to provide container deployments in OpenStack, but I went with it.  Over time, we found kubernetes has a lot to offer OpenStack deployments.  But it lacks a few features which make it unsuitable to deploy “super privileged containers”.

A super privileged container is a container where one or more of the following are true:

  • The container’s processes wants to utilize the host network namespace – specifically –net=host flag.
  • The container’s processes wants to utilize bind mounting – that is mounting a directory from the host fle-system inside the container and share it.
  • The container’s processes wants to utilize the host pid namespace – specifically the –pid=host flag.

Kubernetes could be modified to allow super-privileged containers, but until that day comes, Kubernetes won’t be suitable for  running super-privileged containers.  There is no way to do these things with existing Kubernetes pod files, however, because they have runtime and privilege considerations – essentially they assume the operator trusts the application running in super-privileged mode with the possibility of rooting their entire datacenter.  The kubernetes maintainers have been unwilling to make these options available I suspect because of this concern.

I have spent several weeks researching upgrade of the compute node in nova-networking mode, which consists of a nova-network, nova-compute, and nova-libvirt process.  I started by borrowing the Kolla containers for nova-network and nova-compute and cloned them into a new compute-upgrade repo:

[root@bigiron docker]# ls -l nova-compute
drwxrwxr-x 2 sdake sdake 4096 Jan 28 13:32 nova-compute
drwxrwxr-x 2 sdake sdake 4096 Jan 28 13:27 nova-libvirt
drwxrwxr-x 2 sdake sdake 4096 Jan 21 17:59 nova-network

Each directory contains a container for example nova-compute contains:

[root@bigiron docker]# ls -l nova-compute/nova-compute
total 12
lrwxrwxrwx 1 sdake sdake  33 Jan 21 08:40 build -> ../../../tools/build-docker-image
-rwxrwxr-x 1 sdake sdake 394 Jan 21 08:40
-rw-rw-r-- 1 sdake sdake 365 Jan 28 13:06 Dockerfile
-rwxrwxr-x 1 sdake sdake  83 Jan 28 13:32
[root@bigiron docker]# 

Most of the hard work of this project was building the containers. Half way to victory using the cp command 🙂 Next I sorted out a run command that would run the various containers. I merged the 3 run commands into a script called start-compute.

First, a few directories must be shared for nova-libvirt:

  • /sys: To allow libvirt to communicate with systemd in the host process
  • /sys/fs/cgroup: To allow libvirt to share cgroup changes with the host process
  • /var/lib/libvirt: To allow libvirt and nova to share persistent data
  • /var/lib/nova: To allow libvirt and nova to share persistent data

Second, libvirt must be able to reparent processes to the init (pid=1) systemd process during an upgrade.  If it can’t do that operation, the libvirt qemu processes will have no parent during an upgrade.  Who would be their parent during an upgrade process, where libvirt had been killed? The answer lies in a brand-new docker feature allowing host namespace PID sharing.  In order to gain this super-privilege, the –pid=host flag must be used.

Third, nova-network, nova-libvirt, and nova-compute must share the host network namespace.  To obtain access to this super-privilege, the docker –pid=host operation must be used.

Finally some non-privileged environment variables must be passed to the container using the -e flag. A combination of these flags results in the following launch command:

sudo docker run -d --privileged -e "KEYSTONE_ADMIN_TOKEN=$PASSWORD" -e "NOVA_DB_PASSWORD=$PASSWORD" -e "RABBIT_PASSWORD=$PASSWORD" -e "RABBIT_USERID=stackrabbit" -e NETWORK_MANAGER="nova" -e "GLANCE_API_SERVICE_HOST=$SERVICE_HOST" -e "KEYSTONE_PUBLIC_SERVICE_HOST=$SERVICE_HOST" -e "RABBITMQ_SERVICE_HOST=$SERVICE_HOST" -e "NOVA_KEYSTONE_PASSWORD=$PASSWORD" -v /sys/fs/cgroup:/sys/fs/cgroup -v /var/lib/nova:/var/lib/nova --pid=host --net=host sdake/fedora-rdo-nova-libvirt

My testbed is a two node Fedora 21 cluster. One node runs devstack in nova-network mode. The remaining node simulates a compute node by running the containers produced in this repository with minimal other operating system services running. Note ebtables must be modprobed on the compute node in the host OS and libvirt must be disabled.

I can start the compute node by running start-compute:

[root@minime tools]# ./start-compute
[root@minime tools]# docker ps
CONTAINER ID        IMAGE                                  COMMAND             CREATED             STATUS              PORTS               NAMES
08a20c056078        sdake/fedora-rdo-nova-compute:latest   "/"         5 seconds ago       Up 3 seconds                            insane_leakey          
1365e60a7971        sdake/fedora-rdo-nova-libvirt:latest   "/"         12 seconds ago      Up 10 seconds                           desperate_bell         
c80b0c9b38ef        sdake/fedora-rdo-nova-network:latest   "/"         14 seconds ago      Up 12 seconds                           desperate_mcclintock   

No QEMU processes are running:

[root@minime tools]# machinectl
MACHINE                          CONTAINER SERVICE         

0 machines listed.

After running nova boot on the controller node:

[sdake@bigiron devstack]$ nova boot steaktwo --flavor m1.medium --image Fedora-x86_64-20-20140618-sda

One machine is found via machinectl. I’ll spare you the output of ps, but it is also present.

root@minime tools]# machinectl
MACHINE                          CONTAINER SERVICE         
qemu-instance-00000001           vm        libvirt-qemu    

1 machines listed.

Now stopping the libvirt container:

[root@minime tools]# docker stop 1365e60a7971
[root@minime tools]# docker ps
CONTAINER ID        IMAGE                                  COMMAND             CREATED             STATUS              PORTS               NAMES
08a20c056078        sdake/fedora-rdo-nova-compute:latest   "/"         7 minutes ago       Up 7 minutes                            insane_leakey          
c80b0c9b38ef        sdake/fedora-rdo-nova-network:latest   "/"         7

Now starting the ibvirt container:

docker ps[root@minime tools]# docker ps
CONTAINER ID        IMAGE                                  COMMAND             CREATED             STATUS              PORTS               NAMES
c8368083989e        sdake/fedora-rdo-nova-libvirt:latest   "/"         7 seconds ago       Up 5 seconds                            compassionate_fermat   
08a20c056078        sdake/fedora-rdo-nova-compute:latest   "/"         9 minutes ago       Up 9 minutes                            insane_leakey          
c80b0c9b38ef        sdake/fedora-rdo-nova-network:latest   "/"         9 minutes ago       Up 9 minutes                            desperate_mcclintock

Now the compute VM can be terminated via nova after an upgrade:

[sdake@bigiron devstack]$ nova stop steaktwo

And the VM process disappears:

[root@minime tools]# machinectl
MACHINE                          CONTAINER SERVICE         

0 machines listed.

Ok, so you just showed stopping and starting a container? where is the atomic part? Any container of OpenStack compute can be atomically upgraded as follows:

  • docker pull (to obtain new image)
  • docker stop
  • docker start

From the compute infrastructure, it looks like an atomic upgrade. No messy upgrades of a hundreds of RPM or DEB packages. Just replace a running image with a new image.

It is highly likely I will re-integrate this work into Kolla, since Kolla is the home for R&D related to launching OpenStack within containers. Unfortunately until kubernetes grows the required features, it is unsuitable for a deployment system for OpenStack compute nodes.