Launchpad is an open source suite of tools that help people and teams to work together on software projects, and it includes a build service with over 11,000 Personal Package Archives (PPAs). We’ve recently made some major changes to the underlying infrastructure of this system by migrating it to an OpenStack instance that we call “ScalingStack”.
ScalingStack can be thought of as spot instances for Canonical. It’s designed to run workloads that can tolerate having hypervisors removed midway through any job without negative impact. This enables us to repurpose the underlying hardware, even if only temporarily, for usage by such workloads and then take it back at any point if it needs to be returned for usage in another part of the company. It also allows us to deal with spikes in load by temporarily assigning hardware to it.
Previously the Launchpad build farm was built using Xen, copy-on-write snapshots and ballooning to achieve resets of the VMs in less than 5 seconds, designed back in 2007. This setup involved the creation of a read-only base image, and then an overlay image that the builders would have write access to. Reset scripts in Launchpad would allow it to trigger a teardown and rebuild of a particular builder. We also had a custom network which we called “the airlock” that we would move machines into to set them up in the build infrastructure. This provided us with a way to safely re-use hardware on a temporary basis for builders.
We chose to deploy ScalingStack with OpenStack Icehouse running on Ubuntu 14.04 LTS using MAAS and Juju, as it’s part of our culture to dogfood technologies that Canonical’s customers use. Getting a set up that worked without issue took a significant amount of work, but along the way we helped improve many of the Juju charms used to deploy OpenStack on Ubuntu via Juju, and in the end we had a deployment solution that allowed us to do an end to end OpenStack deployment (we tested it multiple times) in a few hours.
Our initial implementation comprised two OpenStack infrastructure nodes and three dedicated compute nodes. Each compute node had 24 cores, 80G RAM and 1TB of local storage. On the first infrastructure node we ran MAAS, the Juju bootstrap node in KVM, and neutron (also in KVM). On the second infrastructure node we then ran OpenStack components (Glance, Keystone and Nova’s “Cloud Controller” as well as MySQL and RabbitMQ) deployed via Juju in Linux containers (LXC). This setup was codified in a branch containing juju-deployer config files, making for a repeatable deployment process. As we were nearing go-live on production we tore everything down and redeployed from scratch a number of times to confirm our process worked over and over again.
In August 2014, ScalingStack went live and replaced the legacy infrastructure taking the entire builder workload for production Launchpad. Within a few weeks we added a second region in a different data centre, expanding our capacity and giving us the ability to use hardware in ScalingStack from both of Canonical’s major data centres in the UK.
In most senses, ScalingStack is a vanilla OpenStack installation which relies on per-tenant networks in neutron for isolation of the different workloads we plan to run on it. But there are a few differences from a standard OpenStack install that are worth highlighting.
The first is that we pass “cachemode=unsafe” to the “config-flags” configuration option of the nova-compute charm. This means that none of the changes we make to nova-compute instances need to be written to disk, unless we run out of RAM. This buys us some pretty significant speed increases for things like the initial filesystem resize and package upgrades as we boot a new instance, which are typically very IO intensive. Obviously this isn’t something you’d want to do on most production OpenStack installations, but it’s a specific choice here because the workloads can tolerate transient failures which might cause them to need to retry a job.
The second way in which ScalingStack differs from a vanilla OpenStack install is that we have some Juju charms to customise the images we’re using for the Launchpad builders to include specific packages and up to the minute security updates – anything to avoid repeating tasks on each VM boot saves us significant time.
So what kind of changes have we seen as part of this project?
- 67 machines taking 154U (4 racks)
- Manual job to repair builders on an ongoing basis
- Mean build duration: 16 minutes
- 90th percentile wait time (how long before a build started): 78m
- 6 nova-compute nodes taking 12U (less than a third of a rack)
- No need for manual repair job
- Mean build duration: 8 minutes
- 90th percentile wait time: 22m
Even taking into account that the some 67 machines were older and slower hardware, that’s still a significant improvement in hardware density. Since the migration to OpenStack we’ve averaged over 12,000 builds per week. What’s more, adding hardware to ScalingStack is as simple as provisioning it in MAAS, and then doing a “juju add-unit nova-compute”.
We expected a few benefits from this project, including being able to jettison a significant portion legacy code as we migrated to OpenStack, and increased hardware density. However, the scale of the hardware density increase surprised us, as did the performance improvements – the reductions in build time and the build queue were a nice side effect!
Where do we go from here? We have active plans to open up ScalingStack to other workloads, including retracing errors for Ubuntu machines, and slave instances for continuous integration testing of changes to Ubuntu.
Many thanks to the team that worked on the ScalingStack project, particularly Paul Collins who helped drive this project over the finish line, but also William Grant for his work on the Launchpad side of things.