An atomic upgrade process for OpenStack compute nodes

I have been working with container technology since September 2014, sorting out how they are useful in the context of OpenStack.  This led to my involvement in the Kolla project, a project to containerize OpenStack as well as Magnum, a project to provide containers as a service.  Containers are super useful as  an upgrade tool for OpenStack, and the main topic of this blog post.

Kolla began life as a project with dependencies on docker and kubernetes.  I wasn’t always certain the kubernetes dependency was necessary to provide container deployments in OpenStack, but I went with it.  Over time, we found kubernetes has a lot to offer OpenStack deployments.  But it lacks a few features which make it unsuitable to deploy “super privileged containers”.

A super privileged container is a container where one or more of the following are true:

  • The container’s processes wants to utilize the host network namespace – specifically –net=host flag.
  • The container’s processes wants to utilize bind mounting – that is mounting a directory from the host fle-system inside the container and share it.
  • The container’s processes wants to utilize the host pid namespace – specifically the –pid=host flag.

Kubernetes could be modified to allow super-privileged containers, but until that day comes, Kubernetes won’t be suitable for  running super-privileged containers.  There is no way to do these things with existing Kubernetes pod files, however, because they have runtime and privilege considerations – essentially they assume the operator trusts the application running in super-privileged mode with the possibility of rooting their entire datacenter.  The kubernetes maintainers have been unwilling to make these options available I suspect because of this concern.

I have spent several weeks researching upgrade of the compute node in nova-networking mode, which consists of a nova-network, nova-compute, and nova-libvirt process.  I started by borrowing the Kolla containers for nova-network and nova-compute and cloned them into a new compute-upgrade repo:

[root@bigiron docker]# ls -l nova-compute
drwxrwxr-x 2 sdake sdake 4096 Jan 28 13:32 nova-compute
drwxrwxr-x 2 sdake sdake 4096 Jan 28 13:27 nova-libvirt
drwxrwxr-x 2 sdake sdake 4096 Jan 21 17:59 nova-network

Each directory contains a container for example nova-compute contains:

[root@bigiron docker]# ls -l nova-compute/nova-compute
total 12
lrwxrwxrwx 1 sdake sdake  33 Jan 21 08:40 build -> ../../../tools/build-docker-image
-rwxrwxr-x 1 sdake sdake 394 Jan 21 08:40
-rw-rw-r-- 1 sdake sdake 365 Jan 28 13:06 Dockerfile
-rwxrwxr-x 1 sdake sdake  83 Jan 28 13:32
[root@bigiron docker]# 

Most of the hard work of this project was building the containers. Half way to victory using the cp command 🙂 Next I sorted out a run command that would run the various containers. I merged the 3 run commands into a script called start-compute.

First, a few directories must be shared for nova-libvirt:

  • /sys: To allow libvirt to communicate with systemd in the host process
  • /sys/fs/cgroup: To allow libvirt to share cgroup changes with the host process
  • /var/lib/libvirt: To allow libvirt and nova to share persistent data
  • /var/lib/nova: To allow libvirt and nova to share persistent data

Second, libvirt must be able to reparent processes to the init (pid=1) systemd process during an upgrade.  If it can’t do that operation, the libvirt qemu processes will have no parent during an upgrade.  Who would be their parent during an upgrade process, where libvirt had been killed? The answer lies in a brand-new docker feature allowing host namespace PID sharing.  In order to gain this super-privilege, the –pid=host flag must be used.

Third, nova-network, nova-libvirt, and nova-compute must share the host network namespace.  To obtain access to this super-privilege, the docker –pid=host operation must be used.

Finally some non-privileged environment variables must be passed to the container using the -e flag. A combination of these flags results in the following launch command:

sudo docker run -d --privileged -e "KEYSTONE_ADMIN_TOKEN=$PASSWORD" -e "NOVA_DB_PASSWORD=$PASSWORD" -e "RABBIT_PASSWORD=$PASSWORD" -e "RABBIT_USERID=stackrabbit" -e NETWORK_MANAGER="nova" -e "GLANCE_API_SERVICE_HOST=$SERVICE_HOST" -e "KEYSTONE_PUBLIC_SERVICE_HOST=$SERVICE_HOST" -e "RABBITMQ_SERVICE_HOST=$SERVICE_HOST" -e "NOVA_KEYSTONE_PASSWORD=$PASSWORD" -v /sys/fs/cgroup:/sys/fs/cgroup -v /var/lib/nova:/var/lib/nova --pid=host --net=host sdake/fedora-rdo-nova-libvirt

My testbed is a two node Fedora 21 cluster. One node runs devstack in nova-network mode. The remaining node simulates a compute node by running the containers produced in this repository with minimal other operating system services running. Note ebtables must be modprobed on the compute node in the host OS and libvirt must be disabled.

I can start the compute node by running start-compute:

[root@minime tools]# ./start-compute
[root@minime tools]# docker ps
CONTAINER ID        IMAGE                                  COMMAND             CREATED             STATUS              PORTS               NAMES
08a20c056078        sdake/fedora-rdo-nova-compute:latest   "/"         5 seconds ago       Up 3 seconds                            insane_leakey          
1365e60a7971        sdake/fedora-rdo-nova-libvirt:latest   "/"         12 seconds ago      Up 10 seconds                           desperate_bell         
c80b0c9b38ef        sdake/fedora-rdo-nova-network:latest   "/"         14 seconds ago      Up 12 seconds                           desperate_mcclintock   

No QEMU processes are running:

[root@minime tools]# machinectl
MACHINE                          CONTAINER SERVICE         

0 machines listed.

After running nova boot on the controller node:

[sdake@bigiron devstack]$ nova boot steaktwo --flavor m1.medium --image Fedora-x86_64-20-20140618-sda

One machine is found via machinectl. I’ll spare you the output of ps, but it is also present.

root@minime tools]# machinectl
MACHINE                          CONTAINER SERVICE         
qemu-instance-00000001           vm        libvirt-qemu    

1 machines listed.

Now stopping the libvirt container:

[root@minime tools]# docker stop 1365e60a7971
[root@minime tools]# docker ps
CONTAINER ID        IMAGE                                  COMMAND             CREATED             STATUS              PORTS               NAMES
08a20c056078        sdake/fedora-rdo-nova-compute:latest   "/"         7 minutes ago       Up 7 minutes                            insane_leakey          
c80b0c9b38ef        sdake/fedora-rdo-nova-network:latest   "/"         7

Now starting the ibvirt container:

docker ps[root@minime tools]# docker ps
CONTAINER ID        IMAGE                                  COMMAND             CREATED             STATUS              PORTS               NAMES
c8368083989e        sdake/fedora-rdo-nova-libvirt:latest   "/"         7 seconds ago       Up 5 seconds                            compassionate_fermat   
08a20c056078        sdake/fedora-rdo-nova-compute:latest   "/"         9 minutes ago       Up 9 minutes                            insane_leakey          
c80b0c9b38ef        sdake/fedora-rdo-nova-network:latest   "/"         9 minutes ago       Up 9 minutes                            desperate_mcclintock

Now the compute VM can be terminated via nova after an upgrade:

[sdake@bigiron devstack]$ nova stop steaktwo

And the VM process disappears:

[root@minime tools]# machinectl
MACHINE                          CONTAINER SERVICE         

0 machines listed.

Ok, so you just showed stopping and starting a container? where is the atomic part? Any container of OpenStack compute can be atomically upgraded as follows:

  • docker pull (to obtain new image)
  • docker stop
  • docker start

From the compute infrastructure, it looks like an atomic upgrade. No messy upgrades of a hundreds of RPM or DEB packages. Just replace a running image with a new image.

It is highly likely I will re-integrate this work into Kolla, since Kolla is the home for R&D related to launching OpenStack within containers. Unfortunately until kubernetes grows the required features, it is unsuitable for a deployment system for OpenStack compute nodes.

Isn’t it Atomic on OpenStack Ironic, don’t you think?

OpenStack Ironic is a bare metal as a service deployment tool.  Fedora Atomic is a µOS consisting of a very minimal installation of Linux,, Kubernetes and Docker.  Kubernetes is an endpoint manager and container scheduler, while Docker is a container manager.  The basic premise of Fedora Atomic using Ironic is to present a lightweight launching mechanism for OpenStack.

The first step in launching Atomic is to make Ironic operational.  I used devstack for my deployment.  The Ironic developer documentation is actually quite good for a recently Integrated OpenStack project.  I followed the instructions for devstack.  I used pxe+ssh, rather then the agent+ssh.  The pxe+ssh driver virtualizes bare-metal deployment for testing purposes, so only one machine is needed.  The machine should have 16GB+ of RAM.  I find 16GB a bit tight, however.

I found it necessary to hack devstack a bit to get Ironic to operate.  The root cause of the issue is that libvirt can’t write the console log to the home directory as specified in the localrc. To solve the problem I just hacked devstack to write the log files to /tmp. I am sure there is a more elegant way to solve this problem.

The diff of my devstack hack is:

[sdake@bigiron devstack]$ git diff
diff --git a/tools/ironic/scripts/create-node b/tools/ironic/scripts/create-node
index 25b53d4..5ba88ce 100755
--- a/tools/ironic/scripts/create-node
+++ b/tools/ironic/scripts/create-node
@@ -54,7 +54,7 @@ if [ -f /etc/debian_version ]; then
if [ -n "$LOGDIR" ] ; then
- VM_LOGGING="--console-log $LOGDIR/${NAME}_console.log"
+ VM_LOGGING="--console-log /tmp/${NAME}_console.log"

My devstack localrc contains:


disable_service horizon
disable_service rabbit
disable_service quantum
enable_service qpid
enable_service magnum

# Enable Ironic API and Ironic Conductor
enable_service ironic
enable_service ir-api
enable_service ir-cond

# Enable Neutron which is required by Ironic and disable nova-network.
disable_service n-net
enable_service q-svc
enable_service q-agt
enable_service q-dhcp
enable_service q-l3
enable_service q-meta
enable_service neutron

# Create 3 virtual machines to pose as Ironic's baremetal nodes.

# The parameters below represent the minimum possible values to create
# functional nodes.

# Size of the ephemeral partition in GB. Use 0 for no ephemeral partition.

# By default, DevStack creates a network for instances.
# If this overlaps with the hosts network, you may adjust with the
# following.

# Log all output to files

It took me two days to sort out the project in this blog post, and during the process, I learned a whole lot about how Ironic operates by code inspection and debugging.  I couldn’t find much documentation about the deployment process so I thought I’d share a nugget of information about the deployment process:

  • Nova contacts Ironic to allocate an Ironic node providing the image to boot
  • Ironic pulls the image from glance and stores it on the local hard disk
  • Ironic boots a virtual machine via SSH with a PXE-enabled seabios BIOS
  • The seabios code asks Ironic’s tftpserver for a deploy ramdisk and kernel
  • The deployed node starts the deploy kernel and ramdisk
  • The deploy ramdisk does the following:
    1. Starts tgtd to present the root device as an iSCSI disk on the network
    2. Contacts the Ironic ReST API to initiate iSCSI transfer of the image
    3. Waits on port 10000 for a network connection to indicate the iSCSI transfer is complete
    4. Reboots the node once port 10000 has been opened and closed by a process
  • Once the deploy ramdisk contacts Ironic to initiate iSCSI transfer of the image Ironic does the following:
    1. uses iscsiadm to connect to the ISCSI target on the deploy hardware
    2. spawns several dd processes to copy the local disk image to the iSCSI target
    3. Once the dd processes exit successfully, Ironic contacts port 10000 on the deploy node
  • Ironic changes the PXEboot configuration to point to the user’s actual desired ramdisk and kernel
  • The deploy node reboots into SEABIOS again
  • The node boots the proper ramdisk and kernel, which load the disk image that was written via iSCSI

Fedora Atomic does not ship images that are suitable for use with the Ironic model.  Specifically what is needed is a LiveOS image, a ramdisk, and a kernel.  The LiveOS image that Fedora Cloud does ship is not the Atomic version.  Clearly it is early days for Atomic and I expect these requirements will be met as time passes.

But I wanted to deploy Atomic now on Ironic, so I sorted out making a PXE-bootable Atomic Live OS image.

First a bit about how the Atomic Cloud Image is structured:

[sdake@bigiron Downloads]$ guestfish

Welcome to guestfish, the guest filesystem shell for
editing virtual machine filesystems and disk images.

Type: 'help' for help on commands
'man' to read the manual
'quit' to quit the shell

><fs> add-ro Fedora-Cloud-Atomic-20141203-21.x86_64.qcow2
><fs> run
><fs> list-filesystems
/dev/sda1: ext4
/dev/atomicos/root: xfs

The Atomic cloud image has /dev/sda1 containing the contents of the /boot directory.  The /dev/sda2 partition contains a LVM partition.  There is a logical volume called atomicos/root which contains the root filesystem.

Building the Fedora Atomic images for Ironic is as simple as extracting the ramdisk and kernel from /dev/sda1 and extracting /dev/sda2 into an image for Ironic to dd to the iSCSI target.  A bit complicating is that the fstab must have the /boot entry removed.  Determining how to do this was a bit of a challenge, but I wrote a script to automate the Ironic image generation process.

The first step is to test that Ironic actually installs via devstack using the above localrc:

[sdake@bigiron devstack]$ ./
bunch of output from devstack ommitted
Keystone is serving at
Examples on using novaclient command line is in
The default users are: admin and demo
The password: 123456
This is your host ip:

Next, take a look at the default image list which should look something like:

[sdake@bigiron devstack]$ source ./openrc admin admin
[sdake@bigiron devstack]$ glance image-list
| Name                            | Disk Format | Container Format | Size      |
| cirros-0.3.2-x86_64-disk        | qcow2       | bare             | 13167616  |
| cirros-0.3.2-x86_64-uec         | ami         | ami              | 25165824  |
| cirros-0.3.2-x86_64-uec-kernel  | aki         | aki              | 4969360   |
| cirros-0.3.2-x86_64-uec-ramdisk | ari         | ari              | 3723817   |
| Fedora-x86_64-20-20140618-sda   | qcow2       | bare             | 209649664 |
| ir-deploy-pxe_ssh.initramfs     | ari         | ari              | 95220206  |
| ir-deploy-pxe_ssh.kernel        | aki         | aki              | 5808960   |

In this case, we want to boot the UEC image. Ironic expects properties attached to the image ramdisk_id and kernel_id which are the UUIDs of cirros-0.3.2-x86_64-uec-kernel and cirros-0.3.2-x86_64-uec-ramdisk.

Running image-show, we can see these properties:

[sdake@bigiron devstack]$ glance image-show cirros-0.3.2-x86_64-uec 
| Property              | Value                                |
| Property 'kernel_id'  | c11bd198-227f-4156-9195-40b16278b65c |
| Property 'ramdisk_id' | 5e6839ef-daeb-4a1c-be36-3906ed4d7bd7 |
| checksum              | 4eada48c2843d2a262c814ddc92ecf2c     |
| container_format      | ami                                  |
| created_at            | 2014-12-09T14:56:05                  |
| deleted               | False                                |
| disk_format           | ami                                  |
| id                    | 259ca231-66ad-439d-900b-3dc9e9408a0c |
| is_public             | True                                 |
| min_disk              | 0                                    |
| min_ram               | 0                                    |
| name                  | cirros-0.3.2-x86_64-uec              |
| owner                 | 4b798efdcd5142509fe87b12d89d5949     |
| protected             | False                                |
| size                  | 25165824                             |
| status                | active                               |
| updated_at            | 2014-12-09T14:56:06                  |

Now that we have validated the cirros image is available, the next step is to launch one from the demo user:

[sdake@bigiron devstack]$ source ./openrc demo demo
[sdake@bigiron devstack]$ nova keypair-add --pub-key ~/.ssh/ steak
[sdake@bigiron devstack]$ nova boot --flavor baremetal --image cirros-0.3.2-x86_64-uec --key-name steak cirros_on_ironic
[sdake@bigiron devstack]$ nova list
| ID                                   | Name             | Status | Task State | Power State | Networks         |
| 9e64804d-264d-40d2-88f4-e858efe69557 | cirros_on_ironic | ACTIVE | -          | Running     | private= |
[sdake@bigiron devstack]$ ssh cirros@
$ uname -a
Linux cirros-on-ironic 3.2.0-60-virtual #91-Ubuntu SMP Wed Feb 19 04:13:28 UTC 2014 x86_64 GNU/Linux

If this part works, that means you have a working Ironic devstack setup. The next step is to get the Atomic images and convert them for use with Ironic.

[sdake@bigiron fedora-atomic-to-liveos-pxe]$ ./
Mounting boot and root filesystems.
Done mounting boot and root filesystems.
Removing boot from /etc/fstab.
Done removing boot from /etc/fstab.
Extracting kernel to fedora-atomic-kernel
Extracting ramdisk to fedora-atomic-ramdisk
Unmounting boot and root.
Creating a RAW image from QCOW2 image.
Extracting base image to fedora-atomic-base.
cut: invalid byte, character or field list
Try 'cut --help' for more information.
sfdisk: Disk fedora-atomic.raw: cannot get geometry
sfdisk: Disk fedora-atomic.raw: cannot get geometry
12171264+0 records in
12171264+0 records out
6231687168 bytes (6.2 GB) copied, 29.3357 s, 212 MB/s
Removing raw file.

The sfdisk: cannot get geometry warnings can be ignored.

After completion you should have fedora-atomic-kernel, fedora-atomic-ramdisk, and fedora-atomic-base files. Next we register these with glance:

[sdake@bigiron fedora-atomic-to-liveos-pxe]$ ls -l fedora-*
-rw-rw-r-- 1 sdake sdake 6231687168 Dec  9 08:59 fedora-atomic-base
-rwxr-xr-x 1 root  root     5751144 Dec  9 08:59 fedora-atomic-kernel
-rw-r--r-- 1 root  root    27320079 Dec  9 08:59 fedora-atomic-ramdisk
[sdake@bigiron fedora-atomic-to-liveos-pxe]$ glance image-create --name=fedora-atomic-kernel --container-format aki --disk-format aki --is-public True --file fedora-atomic-kernel
| Property         | Value                                |
| checksum         | 220c2e9d97c3f775effd2190199aa457     |
| container_format | aki                                  |
| created_at       | 2014-12-09T16:47:12                  |
| deleted          | False                                |
| deleted_at       | None                                 |
| disk_format      | aki                                  |
| id               | b8e08b02-5eac-467d-80e1-6c8138d0bf57 |
| is_public        | True                                 |
| min_disk         | 0                                    |
| min_ram          | 0                                    |
| name             | fedora-atomic-kernel                 |
| owner            | a28b73a4f29044f184b854ffb7532ceb     |
| protected        | False                                |
| size             | 5751144                              |
| status           | active                               |
| updated_at       | 2014-12-09T16:47:12                  |
| virtual_size     | None                                 |
[sdake@bigiron fedora-atomic-to-liveos-pxe]$ glance image-create --name=fedora-atomic-ramdisk --container-format ari --is-public True --disk-format ari --file fedora-atomic-ramdisk
| Property         | Value                                |
| checksum         | 9ed72ddc0411e2f30d5bbe6b5c2c4047     |
| container_format | ari                                  |
| created_at       | 2014-12-09T16:48:31                  |
| deleted          | False                                |
| deleted_at       | None                                 |
| disk_format      | ari                                  |
| id               | a62f6f32-ed66-4b18-8625-52d7262523f6 |
| is_public        | True                                 |
| min_disk         | 0                                    |
| min_ram          | 0                                    |
| name             | fedora-atomic-ramdisk                |
| owner            | a28b73a4f29044f184b854ffb7532ceb     |
| protected        | False                                |
| size             | 27320079                             |
| status           | active                               |
| updated_at       | 2014-12-09T16:48:31                  |
| virtual_size     | None                                 |
[sdake@bigiron fedora-atomic-to-liveos-pxe]$ glance image-create --name=fedora-atomic --container-format ami --disk-format ami --is-public True --property ramdisk_id=b2f60f33-9c8e-4905-a64b-90997d3dcb92 --property kernel_id=0e687b76-31d0-4351-a92a-a2d348482d42 --file fedora-atomic-base
| Property              | Value                                |
| Property 'kernel_id'  | 0e687b76-31d0-4351-a92a-a2d348482d42 |
| Property 'ramdisk_id' | b2f60f33-9c8e-4905-a64b-90997d3dcb92 |
| checksum              | 6a25f8bf17a94a6682d73b7de0a13013     |
| container_format      | ami                                  |
| created_at            | 2014-12-09T16:52:45                  |
| deleted               | False                                |
| deleted_at            | None                                 |
| disk_format           | ami                                  |
| id                    | d4ec78d7-445a-473d-9b7d-a1a6408aeed2 |
| is_public             | True                                 |
| min_disk              | 0                                    |
| min_ram               | 0                                    |
| name                  | fedora-atomic                        |
| owner                 | a28b73a4f29044f184b854ffb7532ceb     |
| protected             | False                                |
| size                  | 6231687168                           |
| status                | active                               |
| updated_at            | 2014-12-09T16:53:16                  |
| virtual_size          | None                                 |

Next we configure Ironic’s PXE boot config options and restart the ironic conductor in devstack. To restart Ironic conductor use screen -r, find the appropriate conductor screen, press CTRL-C, up arrow, ENTER. This will reload the configuration.

/etc/ironic/ironic.conf should be changed to have this config option:

pxe_append_params = nofb nomodeset vga=normal console=ttyS0 no_timer_check root=/dev/mapper/atomicos-root ostree=/ostree/boot.0/fedora-atomic/a002a2c2e44240db614e09e82c7822322253bfcaad0226f3ff9befb9f96d315f/0

Next we launch the fedora-atomic image using Nova’s baremetal flavor:

[sdake@bigiron ~]$ source /home/sdake/repos/devstack/openrc demo demo
[sdake@bigiron Downloads]$ nova boot --flavor baremetal --image fedora-atomic --key-name steak fedora_atomic_on_ironic
[sdake@bigiron Downloads]$ nova list
| ID                                   | Name                    | Status | Task State | Power State | Networks |
| e7f56931-307d-45a7-a232-c2fa70898cae | fedora-atomic_on_ironic | BUILD  | spawning   | NOSTATE     |          |

Finally login to the Atomic Host:

[sdake@bigiron ironic]$ nova list
| ID                                   | Name                    | Status | Task State | Power State | Networks         |
| d061c0ef-f8b7-4fff-845b-8272a7654f70 | fedora-atomic_on_ironic | ACTIVE | -          | Running     | private= |
[sdake@bigiron ironic]$ ssh fedora@
[fedora@fedora-atomic-on-ironic ~]$ uname -a
Linux fedora-atomic-on-ironic.novalocal 3.17.4-301.fc21.x86_64 #1 
SMP Thu Nov 27 19:09:10 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

I found determining how to create the images from the Fedora Atomic Cloud images a bit tedious. The diskimage builder tool would likely make this easier, if it supported RPM-ostree and Atomic.

Ironic needs some work to allow the pxe options to override the “root” initrd parameter. Ideally a glance image property would be allowed to be specified to override and extend the boot options. I’ve filed an Ironic blueprint for such an improvement.