Preserving contaner properties via volume mounts

In the Kolla project, we were heavily using host bind mounts to share filesystem data with different containers.  A host bind mount is an operation where a host directory, such as /var/lib/mysql is mounted directly into the container at some specific location.

The docker syntax for this operation is:

sudo docker run -d -v /var/lib/mysql:/var/lib/mysql -e MARIADB_ROOT_PASSWORD=password kollaglue/centos-rdo-mariadb-app

This pulls and starts the kollaglue/centos-rdo-mariadb-app container and bind mounts /var/lib/mysql from the host into the container at the same location.  This allows all containers to share the host’s /var/lib/mysql that are started with this bindmount.

Through months of trial and error, we found bind mounting host directories to be highly suboptimal.

Containers exhibit three magic properties.

  • Containers are declarative in nature. A container either starts or fails to start, and should do so consistently. Even though containers typically run imperative code, the imperative nature is abstracted behind a declarative model. So it is possible that an imperative change in the how the container starts could remove this spectacular property. If the service relies on a database, or data stored on the filesystem, the system becomes non-deterministic. Determinism is a major advantage of declarative programming.
  • Containers are immutable. The contents, once created can not be modified except by the container software itself. It is almost like composing an entire distribution including compilers and library runtimes as one binary to be run.
  • Containers should be idempotent. A container should be able to be re-run consistently without failing if it started correctly the first time.

Using a host bind mount weakens or destroys the three magic properties of containers.  Docker, Inc. is intuitively aware this was a problem so they implemented docker data volume containers.  A docker data container is is a container that is started once and creates a docker volume.  A docker volume is permanent persistent storage created by the VOLUME operation in a Dockerfie or the –volume command.  Once the data container is created, it’s docker volume is always available to other docker containers using the volumes-from operation.

The following operation starts a data container based upon the centos image, creates a data volume called /var/lib/myql, and finally runs /bin/true which exits quickly:

docker run -d --name=mariadb_data --volume=/var/lib/mysql centos true

Next the container ID must be retrieved to start the application container:

sudo docker ps -a
CONTAINER ID   IMAGE           COMMAND  CREATED         STATUS                    PORTS  NAMES
56361937ac79    centos:latest  "true"   10 minutes ago  Exited (0) 10 minutes ago        mariadb_data

Next we run the mariadb-app container using the –volumes-from feature. Note docker allows short-hand specification of container id, so in this example 56 is the centos data container 56361937ac79:

sudo docker run -d --volumes-from=56 -e MARIADB_ROOT_PASSWORD=password kollaglue/centos-rdo-mariadb-app

When using data volume containers, all the correct permissions are sorted out by docker automatically. Data is shared between containers. Most importantly it is more difficult to modify the container’s volume contents from outside the container. All of these benefits help preserve the declarative, immutable, and idempotent properties of containers.

We also use data containers for nova-compute in Kolla.  We still continue to use bind mounts in some circumstances.  For example, nova-api needs to run modprobe to load kernel modules.  To support that we allow bind mounting of /var/lib/modules:/var/lib/modules with the :ro (read only) flag.

We also continue to have some container-writeable bind mounts.  The nova-libvirt container requires /sys/fs/cgroups:/sys/fs/cgroups to be bind mounted.  Some types of super privileged containers cannot get away from bind mounts, but most of the Kolla system now runs without them.

An atomic upgrade process for OpenStack compute nodes

I have been working with container technology since September 2014, sorting out how they are useful in the context of OpenStack.  This led to my involvement in the Kolla project, a project to containerize OpenStack as well as Magnum, a project to provide containers as a service.  Containers are super useful as  an upgrade tool for OpenStack, and the main topic of this blog post.

Kolla began life as a project with dependencies on docker and kubernetes.  I wasn’t always certain the kubernetes dependency was necessary to provide container deployments in OpenStack, but I went with it.  Over time, we found kubernetes has a lot to offer OpenStack deployments.  But it lacks a few features which make it unsuitable to deploy “super privileged containers”.

A super privileged container is a container where one or more of the following are true:

  • The container’s processes wants to utilize the host network namespace – specifically –net=host flag.
  • The container’s processes wants to utilize bind mounting – that is mounting a directory from the host fle-system inside the container and share it.
  • The container’s processes wants to utilize the host pid namespace – specifically the –pid=host flag.

Kubernetes could be modified to allow super-privileged containers, but until that day comes, Kubernetes won’t be suitable for  running super-privileged containers.  There is no way to do these things with existing Kubernetes pod files, however, because they have runtime and privilege considerations – essentially they assume the operator trusts the application running in super-privileged mode with the possibility of rooting their entire datacenter.  The kubernetes maintainers have been unwilling to make these options available I suspect because of this concern.

I have spent several weeks researching upgrade of the compute node in nova-networking mode, which consists of a nova-network, nova-compute, and nova-libvirt process.  I started by borrowing the Kolla containers for nova-network and nova-compute and cloned them into a new compute-upgrade repo:

[root@bigiron docker]# ls -l nova-compute
drwxrwxr-x 2 sdake sdake 4096 Jan 28 13:32 nova-compute
drwxrwxr-x 2 sdake sdake 4096 Jan 28 13:27 nova-libvirt
drwxrwxr-x 2 sdake sdake 4096 Jan 21 17:59 nova-network

Each directory contains a container for example nova-compute contains:

[root@bigiron docker]# ls -l nova-compute/nova-compute
total 12
lrwxrwxrwx 1 sdake sdake  33 Jan 21 08:40 build -> ../../../tools/build-docker-image
-rwxrwxr-x 1 sdake sdake 394 Jan 21 08:40
-rw-rw-r-- 1 sdake sdake 365 Jan 28 13:06 Dockerfile
-rwxrwxr-x 1 sdake sdake  83 Jan 28 13:32
[root@bigiron docker]# 

Most of the hard work of this project was building the containers. Half way to victory using the cp command 🙂 Next I sorted out a run command that would run the various containers. I merged the 3 run commands into a script called start-compute.

First, a few directories must be shared for nova-libvirt:

  • /sys: To allow libvirt to communicate with systemd in the host process
  • /sys/fs/cgroup: To allow libvirt to share cgroup changes with the host process
  • /var/lib/libvirt: To allow libvirt and nova to share persistent data
  • /var/lib/nova: To allow libvirt and nova to share persistent data

Second, libvirt must be able to reparent processes to the init (pid=1) systemd process during an upgrade.  If it can’t do that operation, the libvirt qemu processes will have no parent during an upgrade.  Who would be their parent during an upgrade process, where libvirt had been killed? The answer lies in a brand-new docker feature allowing host namespace PID sharing.  In order to gain this super-privilege, the –pid=host flag must be used.

Third, nova-network, nova-libvirt, and nova-compute must share the host network namespace.  To obtain access to this super-privilege, the docker –pid=host operation must be used.

Finally some non-privileged environment variables must be passed to the container using the -e flag. A combination of these flags results in the following launch command:

sudo docker run -d --privileged -e "KEYSTONE_ADMIN_TOKEN=$PASSWORD" -e "NOVA_DB_PASSWORD=$PASSWORD" -e "RABBIT_PASSWORD=$PASSWORD" -e "RABBIT_USERID=stackrabbit" -e NETWORK_MANAGER="nova" -e "GLANCE_API_SERVICE_HOST=$SERVICE_HOST" -e "KEYSTONE_PUBLIC_SERVICE_HOST=$SERVICE_HOST" -e "RABBITMQ_SERVICE_HOST=$SERVICE_HOST" -e "NOVA_KEYSTONE_PASSWORD=$PASSWORD" -v /sys/fs/cgroup:/sys/fs/cgroup -v /var/lib/nova:/var/lib/nova --pid=host --net=host sdake/fedora-rdo-nova-libvirt

My testbed is a two node Fedora 21 cluster. One node runs devstack in nova-network mode. The remaining node simulates a compute node by running the containers produced in this repository with minimal other operating system services running. Note ebtables must be modprobed on the compute node in the host OS and libvirt must be disabled.

I can start the compute node by running start-compute:

[root@minime tools]# ./start-compute
[root@minime tools]# docker ps
CONTAINER ID        IMAGE                                  COMMAND             CREATED             STATUS              PORTS               NAMES
08a20c056078        sdake/fedora-rdo-nova-compute:latest   "/"         5 seconds ago       Up 3 seconds                            insane_leakey          
1365e60a7971        sdake/fedora-rdo-nova-libvirt:latest   "/"         12 seconds ago      Up 10 seconds                           desperate_bell         
c80b0c9b38ef        sdake/fedora-rdo-nova-network:latest   "/"         14 seconds ago      Up 12 seconds                           desperate_mcclintock   

No QEMU processes are running:

[root@minime tools]# machinectl
MACHINE                          CONTAINER SERVICE         

0 machines listed.

After running nova boot on the controller node:

[sdake@bigiron devstack]$ nova boot steaktwo --flavor m1.medium --image Fedora-x86_64-20-20140618-sda

One machine is found via machinectl. I’ll spare you the output of ps, but it is also present.

root@minime tools]# machinectl
MACHINE                          CONTAINER SERVICE         
qemu-instance-00000001           vm        libvirt-qemu    

1 machines listed.

Now stopping the libvirt container:

[root@minime tools]# docker stop 1365e60a7971
[root@minime tools]# docker ps
CONTAINER ID        IMAGE                                  COMMAND             CREATED             STATUS              PORTS               NAMES
08a20c056078        sdake/fedora-rdo-nova-compute:latest   "/"         7 minutes ago       Up 7 minutes                            insane_leakey          
c80b0c9b38ef        sdake/fedora-rdo-nova-network:latest   "/"         7

Now starting the ibvirt container:

docker ps[root@minime tools]# docker ps
CONTAINER ID        IMAGE                                  COMMAND             CREATED             STATUS              PORTS               NAMES
c8368083989e        sdake/fedora-rdo-nova-libvirt:latest   "/"         7 seconds ago       Up 5 seconds                            compassionate_fermat   
08a20c056078        sdake/fedora-rdo-nova-compute:latest   "/"         9 minutes ago       Up 9 minutes                            insane_leakey          
c80b0c9b38ef        sdake/fedora-rdo-nova-network:latest   "/"         9 minutes ago       Up 9 minutes                            desperate_mcclintock

Now the compute VM can be terminated via nova after an upgrade:

[sdake@bigiron devstack]$ nova stop steaktwo

And the VM process disappears:

[root@minime tools]# machinectl
MACHINE                          CONTAINER SERVICE         

0 machines listed.

Ok, so you just showed stopping and starting a container? where is the atomic part? Any container of OpenStack compute can be atomically upgraded as follows:

  • docker pull (to obtain new image)
  • docker stop
  • docker start

From the compute infrastructure, it looks like an atomic upgrade. No messy upgrades of a hundreds of RPM or DEB packages. Just replace a running image with a new image.

It is highly likely I will re-integrate this work into Kolla, since Kolla is the home for R&D related to launching OpenStack within containers. Unfortunately until kubernetes grows the required features, it is unsuitable for a deployment system for OpenStack compute nodes.

Isn’t it Atomic on OpenStack Ironic, don’t you think?

OpenStack Ironic is a bare metal as a service deployment tool.  Fedora Atomic is a µOS consisting of a very minimal installation of Linux,, Kubernetes and Docker.  Kubernetes is an endpoint manager and container scheduler, while Docker is a container manager.  The basic premise of Fedora Atomic using Ironic is to present a lightweight launching mechanism for OpenStack.

The first step in launching Atomic is to make Ironic operational.  I used devstack for my deployment.  The Ironic developer documentation is actually quite good for a recently Integrated OpenStack project.  I followed the instructions for devstack.  I used pxe+ssh, rather then the agent+ssh.  The pxe+ssh driver virtualizes bare-metal deployment for testing purposes, so only one machine is needed.  The machine should have 16GB+ of RAM.  I find 16GB a bit tight, however.

I found it necessary to hack devstack a bit to get Ironic to operate.  The root cause of the issue is that libvirt can’t write the console log to the home directory as specified in the localrc. To solve the problem I just hacked devstack to write the log files to /tmp. I am sure there is a more elegant way to solve this problem.

The diff of my devstack hack is:

[sdake@bigiron devstack]$ git diff
diff --git a/tools/ironic/scripts/create-node b/tools/ironic/scripts/create-node
index 25b53d4..5ba88ce 100755
--- a/tools/ironic/scripts/create-node
+++ b/tools/ironic/scripts/create-node
@@ -54,7 +54,7 @@ if [ -f /etc/debian_version ]; then
if [ -n "$LOGDIR" ] ; then
- VM_LOGGING="--console-log $LOGDIR/${NAME}_console.log"
+ VM_LOGGING="--console-log /tmp/${NAME}_console.log"

My devstack localrc contains:


disable_service horizon
disable_service rabbit
disable_service quantum
enable_service qpid
enable_service magnum

# Enable Ironic API and Ironic Conductor
enable_service ironic
enable_service ir-api
enable_service ir-cond

# Enable Neutron which is required by Ironic and disable nova-network.
disable_service n-net
enable_service q-svc
enable_service q-agt
enable_service q-dhcp
enable_service q-l3
enable_service q-meta
enable_service neutron

# Create 3 virtual machines to pose as Ironic's baremetal nodes.

# The parameters below represent the minimum possible values to create
# functional nodes.

# Size of the ephemeral partition in GB. Use 0 for no ephemeral partition.

# By default, DevStack creates a network for instances.
# If this overlaps with the hosts network, you may adjust with the
# following.

# Log all output to files

It took me two days to sort out the project in this blog post, and during the process, I learned a whole lot about how Ironic operates by code inspection and debugging.  I couldn’t find much documentation about the deployment process so I thought I’d share a nugget of information about the deployment process:

  • Nova contacts Ironic to allocate an Ironic node providing the image to boot
  • Ironic pulls the image from glance and stores it on the local hard disk
  • Ironic boots a virtual machine via SSH with a PXE-enabled seabios BIOS
  • The seabios code asks Ironic’s tftpserver for a deploy ramdisk and kernel
  • The deployed node starts the deploy kernel and ramdisk
  • The deploy ramdisk does the following:
    1. Starts tgtd to present the root device as an iSCSI disk on the network
    2. Contacts the Ironic ReST API to initiate iSCSI transfer of the image
    3. Waits on port 10000 for a network connection to indicate the iSCSI transfer is complete
    4. Reboots the node once port 10000 has been opened and closed by a process
  • Once the deploy ramdisk contacts Ironic to initiate iSCSI transfer of the image Ironic does the following:
    1. uses iscsiadm to connect to the ISCSI target on the deploy hardware
    2. spawns several dd processes to copy the local disk image to the iSCSI target
    3. Once the dd processes exit successfully, Ironic contacts port 10000 on the deploy node
  • Ironic changes the PXEboot configuration to point to the user’s actual desired ramdisk and kernel
  • The deploy node reboots into SEABIOS again
  • The node boots the proper ramdisk and kernel, which load the disk image that was written via iSCSI

Fedora Atomic does not ship images that are suitable for use with the Ironic model.  Specifically what is needed is a LiveOS image, a ramdisk, and a kernel.  The LiveOS image that Fedora Cloud does ship is not the Atomic version.  Clearly it is early days for Atomic and I expect these requirements will be met as time passes.

But I wanted to deploy Atomic now on Ironic, so I sorted out making a PXE-bootable Atomic Live OS image.

First a bit about how the Atomic Cloud Image is structured:

[sdake@bigiron Downloads]$ guestfish

Welcome to guestfish, the guest filesystem shell for
editing virtual machine filesystems and disk images.

Type: 'help' for help on commands
'man' to read the manual
'quit' to quit the shell

><fs> add-ro Fedora-Cloud-Atomic-20141203-21.x86_64.qcow2
><fs> run
><fs> list-filesystems
/dev/sda1: ext4
/dev/atomicos/root: xfs

The Atomic cloud image has /dev/sda1 containing the contents of the /boot directory.  The /dev/sda2 partition contains a LVM partition.  There is a logical volume called atomicos/root which contains the root filesystem.

Building the Fedora Atomic images for Ironic is as simple as extracting the ramdisk and kernel from /dev/sda1 and extracting /dev/sda2 into an image for Ironic to dd to the iSCSI target.  A bit complicating is that the fstab must have the /boot entry removed.  Determining how to do this was a bit of a challenge, but I wrote a script to automate the Ironic image generation process.

The first step is to test that Ironic actually installs via devstack using the above localrc:

[sdake@bigiron devstack]$ ./
bunch of output from devstack ommitted
Keystone is serving at
Examples on using novaclient command line is in
The default users are: admin and demo
The password: 123456
This is your host ip:

Next, take a look at the default image list which should look something like:

[sdake@bigiron devstack]$ source ./openrc admin admin
[sdake@bigiron devstack]$ glance image-list
| Name                            | Disk Format | Container Format | Size      |
| cirros-0.3.2-x86_64-disk        | qcow2       | bare             | 13167616  |
| cirros-0.3.2-x86_64-uec         | ami         | ami              | 25165824  |
| cirros-0.3.2-x86_64-uec-kernel  | aki         | aki              | 4969360   |
| cirros-0.3.2-x86_64-uec-ramdisk | ari         | ari              | 3723817   |
| Fedora-x86_64-20-20140618-sda   | qcow2       | bare             | 209649664 |
| ir-deploy-pxe_ssh.initramfs     | ari         | ari              | 95220206  |
| ir-deploy-pxe_ssh.kernel        | aki         | aki              | 5808960   |

In this case, we want to boot the UEC image. Ironic expects properties attached to the image ramdisk_id and kernel_id which are the UUIDs of cirros-0.3.2-x86_64-uec-kernel and cirros-0.3.2-x86_64-uec-ramdisk.

Running image-show, we can see these properties:

[sdake@bigiron devstack]$ glance image-show cirros-0.3.2-x86_64-uec 
| Property              | Value                                |
| Property 'kernel_id'  | c11bd198-227f-4156-9195-40b16278b65c |
| Property 'ramdisk_id' | 5e6839ef-daeb-4a1c-be36-3906ed4d7bd7 |
| checksum              | 4eada48c2843d2a262c814ddc92ecf2c     |
| container_format      | ami                                  |
| created_at            | 2014-12-09T14:56:05                  |
| deleted               | False                                |
| disk_format           | ami                                  |
| id                    | 259ca231-66ad-439d-900b-3dc9e9408a0c |
| is_public             | True                                 |
| min_disk              | 0                                    |
| min_ram               | 0                                    |
| name                  | cirros-0.3.2-x86_64-uec              |
| owner                 | 4b798efdcd5142509fe87b12d89d5949     |
| protected             | False                                |
| size                  | 25165824                             |
| status                | active                               |
| updated_at            | 2014-12-09T14:56:06                  |

Now that we have validated the cirros image is available, the next step is to launch one from the demo user:

[sdake@bigiron devstack]$ source ./openrc demo demo
[sdake@bigiron devstack]$ nova keypair-add --pub-key ~/.ssh/ steak
[sdake@bigiron devstack]$ nova boot --flavor baremetal --image cirros-0.3.2-x86_64-uec --key-name steak cirros_on_ironic
[sdake@bigiron devstack]$ nova list
| ID                                   | Name             | Status | Task State | Power State | Networks         |
| 9e64804d-264d-40d2-88f4-e858efe69557 | cirros_on_ironic | ACTIVE | -          | Running     | private= |
[sdake@bigiron devstack]$ ssh cirros@
$ uname -a
Linux cirros-on-ironic 3.2.0-60-virtual #91-Ubuntu SMP Wed Feb 19 04:13:28 UTC 2014 x86_64 GNU/Linux

If this part works, that means you have a working Ironic devstack setup. The next step is to get the Atomic images and convert them for use with Ironic.

[sdake@bigiron fedora-atomic-to-liveos-pxe]$ ./
Mounting boot and root filesystems.
Done mounting boot and root filesystems.
Removing boot from /etc/fstab.
Done removing boot from /etc/fstab.
Extracting kernel to fedora-atomic-kernel
Extracting ramdisk to fedora-atomic-ramdisk
Unmounting boot and root.
Creating a RAW image from QCOW2 image.
Extracting base image to fedora-atomic-base.
cut: invalid byte, character or field list
Try 'cut --help' for more information.
sfdisk: Disk fedora-atomic.raw: cannot get geometry
sfdisk: Disk fedora-atomic.raw: cannot get geometry
12171264+0 records in
12171264+0 records out
6231687168 bytes (6.2 GB) copied, 29.3357 s, 212 MB/s
Removing raw file.

The sfdisk: cannot get geometry warnings can be ignored.

After completion you should have fedora-atomic-kernel, fedora-atomic-ramdisk, and fedora-atomic-base files. Next we register these with glance:

[sdake@bigiron fedora-atomic-to-liveos-pxe]$ ls -l fedora-*
-rw-rw-r-- 1 sdake sdake 6231687168 Dec  9 08:59 fedora-atomic-base
-rwxr-xr-x 1 root  root     5751144 Dec  9 08:59 fedora-atomic-kernel
-rw-r--r-- 1 root  root    27320079 Dec  9 08:59 fedora-atomic-ramdisk
[sdake@bigiron fedora-atomic-to-liveos-pxe]$ glance image-create --name=fedora-atomic-kernel --container-format aki --disk-format aki --is-public True --file fedora-atomic-kernel
| Property         | Value                                |
| checksum         | 220c2e9d97c3f775effd2190199aa457     |
| container_format | aki                                  |
| created_at       | 2014-12-09T16:47:12                  |
| deleted          | False                                |
| deleted_at       | None                                 |
| disk_format      | aki                                  |
| id               | b8e08b02-5eac-467d-80e1-6c8138d0bf57 |
| is_public        | True                                 |
| min_disk         | 0                                    |
| min_ram          | 0                                    |
| name             | fedora-atomic-kernel                 |
| owner            | a28b73a4f29044f184b854ffb7532ceb     |
| protected        | False                                |
| size             | 5751144                              |
| status           | active                               |
| updated_at       | 2014-12-09T16:47:12                  |
| virtual_size     | None                                 |
[sdake@bigiron fedora-atomic-to-liveos-pxe]$ glance image-create --name=fedora-atomic-ramdisk --container-format ari --is-public True --disk-format ari --file fedora-atomic-ramdisk
| Property         | Value                                |
| checksum         | 9ed72ddc0411e2f30d5bbe6b5c2c4047     |
| container_format | ari                                  |
| created_at       | 2014-12-09T16:48:31                  |
| deleted          | False                                |
| deleted_at       | None                                 |
| disk_format      | ari                                  |
| id               | a62f6f32-ed66-4b18-8625-52d7262523f6 |
| is_public        | True                                 |
| min_disk         | 0                                    |
| min_ram          | 0                                    |
| name             | fedora-atomic-ramdisk                |
| owner            | a28b73a4f29044f184b854ffb7532ceb     |
| protected        | False                                |
| size             | 27320079                             |
| status           | active                               |
| updated_at       | 2014-12-09T16:48:31                  |
| virtual_size     | None                                 |
[sdake@bigiron fedora-atomic-to-liveos-pxe]$ glance image-create --name=fedora-atomic --container-format ami --disk-format ami --is-public True --property ramdisk_id=b2f60f33-9c8e-4905-a64b-90997d3dcb92 --property kernel_id=0e687b76-31d0-4351-a92a-a2d348482d42 --file fedora-atomic-base
| Property              | Value                                |
| Property 'kernel_id'  | 0e687b76-31d0-4351-a92a-a2d348482d42 |
| Property 'ramdisk_id' | b2f60f33-9c8e-4905-a64b-90997d3dcb92 |
| checksum              | 6a25f8bf17a94a6682d73b7de0a13013     |
| container_format      | ami                                  |
| created_at            | 2014-12-09T16:52:45                  |
| deleted               | False                                |
| deleted_at            | None                                 |
| disk_format           | ami                                  |
| id                    | d4ec78d7-445a-473d-9b7d-a1a6408aeed2 |
| is_public             | True                                 |
| min_disk              | 0                                    |
| min_ram               | 0                                    |
| name                  | fedora-atomic                        |
| owner                 | a28b73a4f29044f184b854ffb7532ceb     |
| protected             | False                                |
| size                  | 6231687168                           |
| status                | active                               |
| updated_at            | 2014-12-09T16:53:16                  |
| virtual_size          | None                                 |

Next we configure Ironic’s PXE boot config options and restart the ironic conductor in devstack. To restart Ironic conductor use screen -r, find the appropriate conductor screen, press CTRL-C, up arrow, ENTER. This will reload the configuration.

/etc/ironic/ironic.conf should be changed to have this config option:

pxe_append_params = nofb nomodeset vga=normal console=ttyS0 no_timer_check root=/dev/mapper/atomicos-root ostree=/ostree/boot.0/fedora-atomic/a002a2c2e44240db614e09e82c7822322253bfcaad0226f3ff9befb9f96d315f/0

Next we launch the fedora-atomic image using Nova’s baremetal flavor:

[sdake@bigiron ~]$ source /home/sdake/repos/devstack/openrc demo demo
[sdake@bigiron Downloads]$ nova boot --flavor baremetal --image fedora-atomic --key-name steak fedora_atomic_on_ironic
[sdake@bigiron Downloads]$ nova list
| ID                                   | Name                    | Status | Task State | Power State | Networks |
| e7f56931-307d-45a7-a232-c2fa70898cae | fedora-atomic_on_ironic | BUILD  | spawning   | NOSTATE     |          |

Finally login to the Atomic Host:

[sdake@bigiron ironic]$ nova list
| ID                                   | Name                    | Status | Task State | Power State | Networks         |
| d061c0ef-f8b7-4fff-845b-8272a7654f70 | fedora-atomic_on_ironic | ACTIVE | -          | Running     | private= |
[sdake@bigiron ironic]$ ssh fedora@
[fedora@fedora-atomic-on-ironic ~]$ uname -a
Linux fedora-atomic-on-ironic.novalocal 3.17.4-301.fc21.x86_64 #1 
SMP Thu Nov 27 19:09:10 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

I found determining how to create the images from the Fedora Atomic Cloud images a bit tedious. The diskimage builder tool would likely make this easier, if it supported RPM-ostree and Atomic.

Ironic needs some work to allow the pxe options to override the “root” initrd parameter. Ideally a glance image property would be allowed to be specified to override and extend the boot options. I’ve filed an Ironic blueprint for such an improvement.

Turbocharge DevStack with Raid 0 SSDs

Turbocharging DevStack

I wanted to turbocharge my development cycle of OpenStack running on Fedora 18 so I could be waiting on my brain rather then waiting on my workstation.  I decided to purchase two modern solid state drives (SSD) and run them in RAID 0.  I chose two Intel S3500 160 GB Enterprise grade SSDs to run in RAID 0.  My second choice was the Samsung 840 Pro which may have been a bit faster, but perhaps not as reliable.

Since OpenStack and DevStack mostly use /var and /opt for their work, I decided to replace only /var and /opt.  If a SSD fails, I am less likely to lose my home directory which may contain some work in progress because of the lower availability of RAID 0.

The Baseline HP Z820

For a baseline my system is a Hewlett Packard Z820 workstation (model #B2C08UT#ABA) that I purchased from Provantage in January 2013.  Most of the computer is a beast sporting an 8 core Intel Xeon 35-2670 @ 2.60GHZ running with Hyperthreading for 16 total cpus, Intel C602 chipset,  and 16 GB Quad Channel DDR3 ECC Unbuffered RAM.

The memory is fast as shown with ramspeed:

[sdake@bigiron ramspeed-2.6.0]$ ./ramspeed -b 3 -m 4096
RAMspeed (Linux) v2.6.0 by Rhett M. Hollander and Paul V. Bolotoff, 2002-09

8Gb per pass mode

INTEGER   Copy:      11549.61 MB/s
INTEGER   Scale:     11550.59 MB/s
INTEGER   Add:       11885.79 MB/s
INTEGER   Triad:     11834.27 MB/s
INTEGER   AVERAGE:   11705.06 MB/s

Unfortunately the disk is a pokey 1TB 7200 RPM model.  The hdparm tool shows a pokey 118MB/sec.

[sdake@bigiron ~]$ sudo hdparm -tT /dev/sda
Timing cached reads: 20590 MB in 2.00 seconds = 10308.76 MB/sec
Timing buffered disk reads: 358 MB in 3.02 seconds = 118.69 MB/sec

Using the Gnome 3 Disk Image Benchmarking tool show a lower average of 82MB per second, although this is also passing through the LVM driver:


Warning: I didn’t run this benchmark with write enabled, as it would have destroyed the data on my disk.

Running takes 6 minutes:

[sdake@bigiron devstack]$ ./
Using mysql database backend
Installing package prerequisites...[|[/]^C[sdake@bigiron devstack]$ 
[sdake@bigiron devstack]$ ./
Using mysql database backend
Installing package prerequisites...done
Installing OpenStack project source...done
Starting qpid...done
Configuring and starting MySQL...done
Starting Keystone...done
Configuring Glance...done
Configuring Nova...done
Configuring Cinder...done
Configuring Nova...done
Using libvirt virtualization driver...done
Starting Glance...done
Starting Nova API...done
Starting Nova...done
Starting Cinder...done
Configuring Heat...done
Starting Heat...done
Uploading images...done
Configuring Tempest...[/]
Heat has replaced the default flavors. View by running: nova flavor-list
Keystone is serving at
Examples on using novaclient command line is in
The default users are: admin and demo
The password: 123456
This is your host ip:
done completed in 368 seconds

I timed a heat stack-create operation at about 34 seconds.  In a typical day I may create 50 or more stacks, so the time really adds up.

Turbo-charged DevStack

After installing two SSD devices, I decided to use LVM raid 0 striping.  Linux Magazine indicates mdadm is faster, but I prefer a single management solution for my disks.

The hdparm tool shows some a beast 1GB/sec throughput on reads:

[sdake@bigiron ~]$ sudo hdparm -tT /dev/raid0_vg/ssd_opt

Timing cached reads: 21512 MB in 2.00 seconds = 10771.51 MB/sec
Timing buffered disk reads: 3050 MB in 3.00 seconds = 1016.47 MB/sec

I also ran the Gnome 3 disk benchmarking tool, this time in write mode.  It showed an average 930MB/sec read and 370MB/sec write throughput:


I ran in a little under 3 minutes:

[sdake@bigiron devstack]$ ./
Using mysql database backend
Installing package prerequisites...done
Installing OpenStack project source...done
Starting qpid...done
Configuring and starting MySQL...done
Starting Keystone...done
Configuring Glance...done
Configuring Nova...done
Configuring Cinder...done
Configuring Nova...done
Using libvirt virtualization driver...done
Starting Glance...done
Starting Nova API...done
Starting Nova...done
Starting Cinder...done
Configuring Heat...done
Starting Heat...done
Uploading images...done
Configuring Tempest...[|]
Heat has replaced the default flavors. View by running: nova flavor-list
Keystone is serving at
Examples on using novaclient command line is in
The default users are: admin and demo
The password: 123456
This is your host ip:
done completed in 166 seconds

I timed a heat stack create at 6 seconds.  Comapred to the non-ssd 34 seconds, RAID 0 SSDs rock!  Overall system seems much faster and benchmarking shows it.

How we use CloudInit in OpenStack Heat

Many people over the past year have asked me how exactly to use CloudInit while the Heat developers have implemented OpenStack Heat.  Since CloudInit is the default virtual machine bootstrapping system on Debian, Fedora, Red Hat Enterprise Linux, Ubuntu and likely more distros, we decided to start with CloudInit as our base bootstrapping system.  I’ll present a code walk-through of how we use CloudInit inside OpenStack Heat.

Reading the CloudInit documentation is helpful, but it lacks programming examples of how to develop software to inject data into virtual machines using CloudInit.   The OpenStack Heat project implements injection in Python for CloudInit-enabled virtual machines.  Injection occurs by passing information to the virtual machine that is decoded by CloudInit.

IaaS paltforms require a method for users to pass data into the virtual machine.  OpenStack provides a metadata server which is co-located with the rest of the OpenStack infrastructure   When the virtual machine is booted, it can then make a HTTP request to a specific URI and return the user data passed to the instance during instance creation.

CloudInit’s job is to contact the metadata server and bootstrap the virtual machine with desired configurations.  In OpenStack Heat, we do this with three specific files.

The first file is our CloudInit configuration file:

- setenforce 0 > /dev/null 2>&1 || true
user: ec2-user

- locale
- set_hostname
- ssh
- timezone
- update_etc_hosts
- update_hostname
- runcmd

# Capture all subprocess output into a logfile
# Useful for troubleshooting cloud-init issues
output: {all: '| tee -a /var/log/cloud-init-output.log'}

This file directs CloudInit to turn off SELinux, install ssh keys for the user ec2-user, setup the locale, hostname, ssh, timezone, modify /etc/hosts with correct information and output the results of all cloud-init data to /var/log/cloud-init-output.log

There are many cloud config modules which provide different functionality.  Unfortunately they are not well documented, so the source must be read to understand their behavior.  For a list of cloud config modules, check the upstream repo.

Another file required by OpenStack Heat’s support for CloudInit is a part handler:

import os
import datetime
def list_types():

def handle_part(data, ctype, filename, payload):
if ctype == "__begin__":
os.makedirs('/var/lib/heat-cfntools', 0700)
except OSError as e:
if e.errno != errno.EEXIST:

if ctype == "__end__":

with open('/var/log/part-handler.log', 'a') as log:
timestamp =
log.write('%s filename:%s, ctype:%s\n' % (timestamp, filename, ctype))

if ctype == 'text/x-cfninitdata':
with open('/var/lib/heat-cfntools/%s' % filename, 'w') as f:

The file is executed by CloudInit to separate the UserData provided by the MetaData server in OpenStack.  CloudInit executes handle_part() for each part of a multi-part mime message which CloudInit doesn’t know how to decode.  This is how OpenStack Heat passes unique information for each virtual machine to assist in the orchestration process.  The first ctype is always set to __begin__. which triggers handle_part() to create the directory /var/lib/heat-cfntools.

The OpenStack Heat instance launch code uses the mime type of x-cfninitdata  In OpenStack Heat.  OpenStack Heat passes several files via this mime subtype each of which is decoded and stored in /var/lib/heat-cfntools.

The final file required is a script which runs at first boot:

#!/usr/bin/env python

path = '/var/lib/heat-cfntools'

def chk_ci_version():
    v = LooseVersion(pkg_resources.get_distribution('cloud-init').version)
    return v >= LooseVersion('0.6.0')

def create_log(path):
    fd =, os.O_WRONLY | os.O_CREAT, 0600)
    return os.fdopen(fd, 'w')

def call(args, log):
    log.write('%s\n' % ' '.join(args))
    p = subprocess.Popen(args, stdout=log, stderr=log)
    return p.returncode

def main(log):

    if not chk_ci_version():
        # pre 0.6.0 - user data executed via cloudinit, not this helper
        log.write('Unable to log provisioning, need a newer version of'
                  ' cloud-init\n')
        return -1

    userdata_path = os.path.join(path, 'cfn-userdata')
    os.chmod(userdata_path, 0700)

    log.write('Provision began: %s\n' %
    returncode = call([userdata_path], log)
    log.write('Provision done: %s\n' %
    if returncode:
        return returncode

if __name__ == '__main__':
    with create_log('/var/log/heat-provision.log') as log:
        returncode = main(log)
        if returncode:
            log.write('Provision failed')

    userdata_path = os.path.join(path, 'provision-finished')
    with create_log(userdata_path) as log:
        log.write('%s\n' %

This script logs the output of the execution of /var/lib/heat-cfn/cfnuserdata.

These files are co-located with OpenStack Heat’s engine process which loads these files and combines them plus other OpenStack Heat specific configuration blobs into one multipart mime message.

OpenStack Heat’s UserData generator:

    def _build_userdata(self, userdata):
        if not self.mime_string:
            # Build mime multipart data blob for cloudinit userdata

            def make_subpart(content, filename, subtype=None):
                if subtype is None:
                    subtype = os.path.splitext(filename)[0]
                msg = MIMEText(content, _subtype=subtype)
                msg.add_header('Content-Disposition', 'attachment',
                return msg

            def read_cloudinit_file(fn):
                return pkgutil.get_data('heat', 'cloudinit/%s' % fn)

            attachments = [(read_cloudinit_file('config'), 'cloud-config'),
                           (userdata, 'cfn-userdata', 'x-cfninitdata'),
                            '', 'x-shellscript')]

            if 'Metadata' in self.t:
                                    'cfn-init-data', 'x-cfninitdata'))

                                'cfn-watch-server', 'x-cfninitdata'))

                                'cfn-metadata-server', 'x-cfninitdata'))

            # Create a boto config which the cfntools on the host use to know
            # where the cfn and cw API's are to be accessed
            cfn_url = urlparse(cfg.CONF.heat_metadata_server_url)
            cw_url = urlparse(cfg.CONF.heat_watch_server_url)
            is_secure = cfg.CONF.instance_connection_is_secure
            vcerts = cfg.CONF.instance_connection_https_validate_certificates
            boto_cfg = "\n".join(["[Boto]",
                                  "debug = 0",
                                  "is_secure = %s" % is_secure,
                                  "https_validate_certificates = %s" % vcerts,
                                  "cfn_region_name = heat",
                                  "cfn_region_endpoint = %s" %
                                  "cloudwatch_region_name = heat",
                                  "cloudwatch_region_endpoint = %s" %

                                'cfn-boto-cfg', 'x-cfninitdata'))

            subparts = [make_subpart(*args) for args in attachments]
            mime_blob = MIMEMultipart(_subparts=subparts)

            self.mime_string = mime_blob.as_string()

        return self.mime_string

This code provides two functions:

  • make_subpart: Takes a list of attachments and creates mime subparts out of them
  • read_cloudinit_file: Reads OpenStack Heat’s CloudInit three files above

The rest of the function generates the UserData OpenStack Heat needs based upon the attachments list.  These attachments are then turned into a mime message which is passed to the instance creation:

       server_userdata = self._build_userdata(userdata)
        server = None
            server = self.nova().servers.create(
            # Avoid a race condition where the thread could be cancelled
            # before the ID is stored
            if server is not None:

This snippet of code creates the UserData and passes it to the nova server create operation.

The flow is then:

  1. Create user data
  2. Heat creates nova server instance with user data
  3. Nova creates the instance
  4. CloudInit distro initialization occurs
  5. CloudInit reads config from OpenStack metadata server UserData information
  6. CloudInit executes with __start__
  7. CloudInit executes for each x-cfninitdata mime type
  8. writes the contents of each x-cfninitdata mime subpart to /var/lib/heat-cfntools on the instance
  9. CloudInit executes with __end__
  10. CloudInit executes the configuration operations defined by the config file
  11. CloudInit runs the x-shellscript blob which in this case is
  12. logs the output of  /var/lib/heat-cfn/cfnuserdata which is the initialization script set in the OpenStack Heat templates

This code walk-through will help developers understand how OpenStack Heat integrates with CloudInit and provide a better understanding of how to use CloudInit in your own Python applications if you roll your own bootstrapping process.

The Heat API – A template based orchestration framework

Over the last year, Angus Salkeld and I have been developing a IAAS high availability service called Pacemaker Cloud.  We learned that the problem we were really solving was orchestration.  Another dev group was also looking at this problem inside Red Hat from the launching side.  We decided to take two weeks off from our existing work and see if we could join together to create a proof of concept implementation from scratch of AWS CloudFormation for OpenStack.  The result of that work was a proof of concept project which provided launching of a WordPress template, as had been done in our previous project.

The developers decided to take another couple weeks to determine if we could get a more functional system that would handle composite virtual machines.  Today, we released that version, our second iteration of  the Heat API.  Since we have many more developers, and a project that exceeded our previous functionality of Pacemaker Cloud, the Heat Development Community has decided to cease work on our previous orchestration projects and focus our efforts on Heat.

A bit about Heat:  The Heat API implements the AWS Cloud Formations API.  This API provides a rest interface for creating composite VMs called Stacks from template files.  The goal of the software is to be able to accurately launch AWS CloudFormation Stacks on OpenStack.  We will also enable good quality high availability based upon the technologies we created in Pacemaker Cloud including escalation.

Given that C was a poor choice of implementation language for making REST based cloud services, Heat is implemented in Python which is fantastic for REST services.  The Heat API also follows OpenStack design principles.  Our initial design after our POC shows the basics of our architecture and our quickstart guide can be used with our second iteration release.

mailing list is available for developer and user discussion.  We track milestones and issues using github’s issue tracker.  Things are moving fast – come join our project on github or chat with the devs on #heat on freenode!

Corosync 2.0.0 released!

A few short weeks after Corosync 1.0.0 was released, the developers huddled for our future planning of Corosync 2.0.0.  The major focus of that meeting was “Corosync as implemented is too complicated”.  We had threads, semaphores, mutexes, an entire protocol, plugins, a bunch of unused services, a backwards compatability layer, multiple cryptographic engines.

Going for us, we did have a world class group communication system implementation (if not a little complicated) developed by a large community of developers, battle hardened by thousands of field deployments, tested by tens of thousands of community members.

As a result of that meeting, we decided to keep the good and throw out the bad, as we did between the openais and corosync transitions.  Gone are threads.  Gone are compatibility layers.  Gone are plugins.  Gone are unsupported encryption engines.  Gone are a bunch of other user-invisible junk that was crudding up the code base.

Shortly after Corosync 2.0.0 development was started, Angus Salkeld had the great idea of taking the infrastructure in corosync (IPC, Logging, Timers, Poll loop, shared memory, etc) and putting that into a new project called libqb.  The objective of this work was obvious:  To create a world-class infrastructure library specifically focused on the needs of cluster developers with a great built-in make-check test suite.

This helped us reach even closer to our goals of simplification.  As we pushed the infrastructure out of base Corosync, we could focus more on protocols/APIs.  You would be surprised to find that implementing the infrastructure took about as much effort as the rest of the system (APIs and Totem).

All of this herculean effort wouldn’t be possible without our developer and user community.  I’d especially like to acknowledge Jan Friesse in his leadership role of helping to coordinate the upstream release process and drive the upstream feature set to 2.0.0 resolution.  Angus Salkeld was invaluable in his huge libqb effort which occurred on time and with great quality.  Finally I want to thank Fabio Di Nitto for beating various parts of the Corosync code base into submission and his special role in designing the votequorum API.  There are many other contributors including developers and tested who I won’t mention individually, but I’d also like to thank for their improvements to the code base.

Great job devs!!  Now its up to the users of Corosync to tell us if we delivered on our objective we set out with 18 months ago – making Corosync 2.0 faster, simpler, smaller, and most importantly higher quality.

The software can be downloaded from Corosync’s Website.  Corosync 2.0, as well as the rest of the improved community developed cluster stack will show up in distros as they refresh their stacks.

Announcing Pacemaker Cloud 0.6.0 release

I am super pleased to announce the release of Pacemaker Cloud 0.6.0.

Pádraig Brady will be providing a live demonstration of Pacemaker Cloud integrated with OpenStack at FOSDEM.


What is pacemaker cloud?

Pacemaker Cloud is a high scale high availability system for virtual machine and cloud environments.  Pacemaker Cloud uses the techniques of fault detection, fault isolation, recovery and notification to provide a full high availability solution tailored to cloud environments.

Pacemaker Cloud combines multiple virtual machines (called assemblies) into one application group (called a deployable).  The deployable is then managed to maintain an active and running state in in the face of failures.  Recovery escalation is used to recover from repetitive failures and drive the deployable to a known good working state.

New in this release:

  • OpenStack integration
  • Division of supported infrastructures into separate packaging
  • Ubuntu assembly support
  • WordPress vm + MySql deployable demonstration
  • Significantly more readable event notification output
  • Add ssh keys to generated assemblies
  • Recovery escalation
  • Bug fixes
  • Performance enhancements

Where to get the software:

The software is available for download on the project’s website.

Adding second monitoring method to Pacemaker Cloud – sshd

Recently Angus Salkeld and I have decided to start working on a second approach to Pacemaker Cloud monitoring. Today we monitor with Matahari. We would also like the ability to monitor with OpenSSH’s sshd.  With this model, sshd becomes a second monitoring agent in addition to Matahari.  Since sshd is everywhere, and everyone is comfortable with the SSH security model, we believe this makes a superb alternative monitoring solution.

To help kick that work off, I’ve started a new branch in our git repository where this code will be located called topic-ssh.

To summarize the work, we are taking the dped binary and making a second libssh2 specific binary based on the work of the dped. We will also integrate directly with libdeltacloud as part of this work. The output of this topic will be the major work in the 0.7.0 release.

We looked at python as the language for dped, but testing showed that not to be particularly feasible without drastically complicating our operating model. With our model of running thousands of dpe processes on one system, one dpe per deployable, we would need python to have a small footprint. Testing showed that python consumes 15 times as much memory per dpe instance vs a comparable C binary.

We think there are many opportunities for people without a strong C skillset, but with a strong python skillset to contribute tremendously to the project in the CPE component. We plan to rework the CPE process into a python implementation.

If you want to get involved in the project today, working on the CPE C++ to python rework would be a great place to start!

Release schedule for Corosync Needle (2.0)

Over the last 18 months, the Corosync development community has been hard at work making Corosync Needle (version 2.0.0) a reality.  This release offers an evolutionary step in Corosync by adding several community requested features, removing the troubling threads and plugins, and tidying up the quorum code base.

I would like to point out the dilligent work of Jan Friesse (Honza) for tackling the 15 or so feature backlog items on our feature list.  Angus Salkeld has taken the architectural step of moving the infrastructure (ipc, logging, and other infrastructure components) of Corosync into a separate project (  Finally I’d like to point out the excellent work of Fabio Di Nitto and his cabal for tackling the quorum code base to make it truly usable for bare metal clusters.

The release schedule is as follows:

Alpha		January 17, 2012	version 1.99.0
Beta		January 31, 2012	version 1.99.1
RC1		February 7, 2012	version 1.99.2
RC2		February 14, 2012	version 1.99.3
RC3		February 20, 2012	version 1.99.4
RC4		February 27, 2012	version 1.99.5
RC5		March 6, 2012		version 1.99.6
Release 2.0.0	March 13, 2012		version 2.0.0