Category Archives: Juju

Neutron, ZeroMQ and Git – Ubuntu OpenStack 15.04 Charm release!

Alongside the Ubuntu 15.04 release on the 23rd April, the Ubuntu OpenStack Engineering team delivered the latest release of the OpenStack charms for deploying and managing OpenStack on Ubuntu using Juju.

Here are some selected highlights from this most recent charm release.

OpenStack Kilo support

As always, we’ve enabled charm support for OpenStack Kilo alongside development. To use this new release use the openstack-origin configuration option of the charms, for example:

juju set cinder openstack-origin=cloud:trusty-kilo

NOTE: Setting this option on an existing deployment will trigger an upgrade to Kilo via the charms – remember to plan and test your upgrade activities prior to production implementation!

Neutron

As part of this release, the team have been working on enabling some of the new Neutron features that were introduced in the Juno release of OpenStack.

Distributed Virtual Router

One of the original limitations of the Neutron reference implementation (ML2 + Open vSwitch) was the requirement to route all north/south and east/west network traffic between instance via network gateway nodes.

For Juno, the Distributed Virtual Router (DVR) function was introduced to allow routing capabilities to be distributed more broadly across an OpenStack cloud.

DVR pushes alot of the layer 3 network routing function of Neutron directly onto compute nodes – instances which have floating IP’s no longer have the restriction of routing via a gateway node for north/south traffic. This traffic is now pushed directly to the external network by the compute nodes via dedicated external network ports, bypassing the requirement for network gateway nodes.

Network gateway nodes are still required for snat northbound routing for instances that don’t having floating ip addresses.

For the 15.04 charm release, we’ve enabled this feature across the neutron-api, neutron-openvswitch and neutron-gateway charms – you can toggle this capability using configuration in the neutron-api charm:

juju set neutron-api enabled-dvr=true l2-population=true \
    overlay-network-type=vxlan

This feature requires that every compute node have a physical network port onto the external public facing network – this is configured on the neutron-openvswitch charm, which is deployed alongside nova-compute:

juju set neutron-openvswitch ext-port=eth1

NOTE: Existing routers will not be switched into DVR mode by default – this must be done manually by a cloud administrator.  We’ve also only tested this feature with vxlan overlay networks – expect gre and vlan enablement soon!

Router High Availability

For Clouds where the preference is still to route north/south traffic via a limited set of gateway nodes, rather than exposing all compute nodes directly to external network zones, Neutron has also introduced a feature to enable virtual routers in highly available configurations.

To use this feature, you need to be running multiple units of the neutron-gateway charm – again it’s enabled via configuration in the neutron-api charm:

juju set neutron-api enable-l3ha=true l2-population=false

Right now Neutron DVR and Router HA features are mutually exclusive due to layer 2 population driver requirements.

Our recommendation is that these new Neutron features are only enabled with OpenStack Kilo as numerous features and improvements have been introduced over the last 6 months since first release with OpenStack Juno.

Initial ZeroMQ support

The ZeroMQ lightweight messaging kernel is a library which extends the standard socket interfaces with features traditionally provided by specialised messaging middleware products, without the requirement for a centralized message broker infrastructure.

Interest and activity around the 0mq driver in Oslo Messaging has been gathering pace during the Kilo cycle, with numerous bug fixes and improvements being made into the driver code.

Alongside this activity, we’ve enabled ZeroMQ support in the Nova and Neutron charms in conjunction with a new charm – ‘openstack-zeromq’:

juju deploy redis-server
juju deploy openstack-zeromq
juju add-relation redis-server openstack-zeromq
for svc in nova-cloud-controller nova-compute \
    neutron-api neutron-openvswitch quantum-gateway; do
    juju deploy $svc
    juju add-relation $svc openstack-zeromq
done

The ZeroMQ driver makes use of a Redis server to maintain a catalog of topic endpoints for the OpenStack cloud so that services can figure out where to send RPC requests.

We expect to enable further charm support as this feature matures upstream – so for now please consider this feature for testing purposes only.

Deployment from source

A core set of the OpenStack charms have also grown the capability to deploy from git repositories, rather than from the usual Debian package sources from Ubuntu.   This allows all of the power of deploying OpenStack using charms to be re-used with deployments from active development.

For example, you’ll still be able to scale-out and cluster OpenStack services deployed this way –  seeing a keystone service deploy from git, running with haproxy, corosync and pacemaker as part of a fully HA deployment is pretty awesome!

This feature is currently tested with the stable/icehouse and stable/juno branches – we’re working on completing testing of the kilo support and expect to land that as a stable update soon.

This feature is considered experimental and we expect to complete further improvements and enablement across a wider set of charms – so please don’t use it for production services!

And finally…

Alongside the features delivered in this release, we’ve also been hard at work resolving bugs across the charms – please refer to milestone bug report for the full details.

We’ve also introduced features to enable easier monitoring with Nagios and support for Keystone PKI tokens as well as some improvements in the failure detection capabilities of the percona-cluster charm when operating in HA mode.

You can get the full low down on all of the changes in this release from the official release notes.

Tagged , ,

OpenStack Summit Vancouver: Ubuntu OpenStack team presentations

Amongst the numerous submissions for speaking slots at the OpenStack Summit in Vancouver in May, you’ll find a select number of submissions from my team:

Multi-node OpenStack development on single system (Speakers: James Page, Corey Bryant)

Corey has been having some fun hacking on enabling deployment from source in the OpenStack Juju Charms for Ubuntu – come and hear about what we’ve done so far and how we’re trying to enable a multi-node OpenStack deployment from source in a single node using KVM and LXC container, with devstack style reloads!

Scaling automated testing of Ubuntu OpenStack (Speakers: James Page, Ryan Beisner, Liam Young)

The Ubuntu OpenStack team have a ever increasing challenge of supporting testing of numerous OpenStack versions on many different Ubuntu releases; we’ll be covering how we’ve used OpenStack itself to help us scale-out our testing infrastructure to support these activities, as well as some of the technologies and tools we use to deploy and test OpenStack itself.

OpenStack HA Nirvana on Ubuntu (Speaker: James Page)

We’ve been able to deploy OpenStack in Highly Available configurations using Juju and Ubuntu since the Portland Summit in 2013 – since then we have evolved and battle-tested our HA reference architecture into a rock-solid solution to ensure availability of cloud services to end users.  This session will cover the Ubuntu OpenStack HA reference architecture in detail – we might even manage a demo as well!

Testing Openstack with Openstack (Speaker: Ryan Beisner)

Ryan Beisner has been leading Ubuntu OpenStack QA for Canonical since 2014; he’ll be deep-diving on the challenges faced in ensuring the quality of Ubuntu OpenStack and how we’ve leveraged the awesome tool set we have in Ubuntu for deploying and testing OpenStack to support testing of OpenStack both virtually and on bare-metal 100’s of times a day.

also of interest, and building on and around the base technology that the Ubuntu OpenStack team delivers:

OpenStack IPv6 Support (Speaker: Edward Hope-Morley)

Ed’s team have made great in-roads into enabling Ubuntu OpenStack deployments in IPv6 only environments; he’ll be discussing the challenges encountered and how the team overcame them as well as setting out some suggested improvements that would make IPv6 support a first class citizen for OpenStack.

Autopiloting OpenStack (Speaker: Dean Henrichsmeyer)

Dean will be talking about how the Ubuntu OpenStack Autopilot pulls together all of the various technologies in Ubuntu (MAAS, Juju and OpenStack) to fully automate deployment and scale-out of complex OpenStack deployments on Ubuntu.

Containers for Dummies (Speaker: Tycho Andersen)

Tycho promises an enlightening and fun talk about containers introducing all the basic technologies in Linux that support containers – all done through the medium of pictures of cats!

You can find the full list of Canonical submissions here – see you all in Vancouver!

Tagged ,

Ubuntu OpenStack Charms: 15.01 release

OpenStack NovaThe Ubuntu Server team is pleased to announce their first interim release, 15.01, of charm features and fixes for the Ubuntu OpenStack charms for Juju – here are some selected highlights:

Clustering

General improvements have been made to the hacluster charm that we use for clustering OpenStack services; specifically the way quorum is handled in pacemaker and corosync has been improved so that clusters should react more appropriately in situations where one or more units fail.

We’ve also introduced a unicast mode for corosync cluster communication – this is useful in environments where multicast UDP might be disabled; in testing this has also proven much more reliable if you are running services under LXC containers spread across physical servers, and is the recommended configuration for these types of deployment.

Tuning

The ceph, ceph-osd, nova-compute and quantum-gateway charms have all gained a tuning configuration option which allows users to set sysctl options – we’ve provided some best practice defaults in the ceph charms, but this feature will allow expert users to tune Ubuntu away to their hearts content!

High Availability

The ceilometer and ceph-radosgw charms have grown HA support (using the hacluster charm) and the quantum-gateway charm now has a configuration option for Icehouse users to enable a legacy ha mode (again using the hacluster charm) to ensure that routers and networks are recovered onto active gateway nodes in the event that a unit fails.

We’ve also improved the nova-cloud-controller charm so that guest console access can be used in HA deployments by providing a memcached back-end for token storage and sharing between units.

Nova Ceph Storage Support

The nova-compute charm has grown support for different storage back-ends; the first new back-end support is for Ceph, allowing users to use Ceph for default storage of instance root and ephemeral disks.  You’ll want to be running some serious networking to use this feature – remember all those reads and writes will be going over the network!

And finally..

You can checkout the list of bugs closed and read the full release notes – which contain more detail on these new features!

Thanks go to all the charm contributors:

  • Edward Hope-Morley
  • Billy Olsen
  • Liang Chen
  • Jorge Niedbalski
  • Xiang Hui
  • Felipe Reyes
  • Yaguang Tang
  • Seyeong Kim
  • Jorge Castro
  • Corey Bryant
  • Tom Haddon
  • Brad Marshall
  • Liam Young
  • Ryan Beisner

awesome job guys!

EOM

Tagged , , , ,

Extreme OpenStack: Scale testing OpenStack Messaging

Just prior to the Paris OpenStack Summit in November, the Ubuntu Server team had the opportunity to repeat and expand on the scale testing of OpenStack Icehouse that we did in the first quarter of last year with AMD and SeaMicro. HP where kind enough to grant us access to a few hundred servers in their Discovery Lab; specifically three chassis of HP ProLiant Moonshot m350 cartridges (540 in total): indexThe m350 is an 8-core Intel Atom based server with 16GB of RAM and 64GB of SSD based direct attached storage. They are designed for scale out workloads, so not an immediately obvious choice for an OpenStack Cloud, but for the purposes of stretching OpenStack to the limit, having lots of servers is great as it puts load on central components in Neutron and Nova by having a large number of hypervisor edges to manage. We had a few additional objectives for this round of scale testing over and above re-validating the previous scale test we did on Icehouse on the new Juno release of OpenStack:

  • Messaging: The default messaging solution for OpenStack on Ubuntu is RabbitMQ; alternative messaging solutions have been supported for some time – we wanted to specifically look at how ZeroMQ, a broker-less messaging option, scales in a large OpenStack deployment.
  • Hypervisor: The testing done previously was based on the libvirt/kvm stack with Nova; The LXC driver was available in an early alpha release so poking at this looked like it might be fun.

As you would expect, we used the majority of the same tooling that we used in the previous scale test:

  • MAAS (Metal-as-a-Service) for deployment of physical server resources
  • Juju: installation and configuration of OpenStack on Ubuntu

in addition, we also decided to switch over to OpenStack Rally to complete the actual testing and benchmarking activities. During our previous scale test this project was still in its infancy but its grown a lot of features in the last 9 months including better support for configuring Neutron network resources as part of test context set-up.

Messaging Scale

The first comparison we wanted to test was between RabbitMQ and ZeroMQ; RabbitMQ has been the messaging workhorse for Ubuntu OpenStack deployments since our first release, but larger clouds do make high demands on a single message broker both in terms of connection concurrency and message throughput. ZeroMQ removes the central broker from the messaging topology, switching to a more directly connected edge topology.

The ZeroMQ driver in Oslo Messaging has been a little unloved over the last year or so, however some general stability improvements have been made – so it felt like a good time to take a look and see how it scales. For this part of the test we deployed a cloud of:

  • 8 Nova Controller units, configured as a cluster
  • 4 Neutron Controller units, configured as a cluster
  • Single MySQL, Keystone and Glance units
  • 300 Nova Compute units
  • Ganglia for monitoring

In order to push the physical servers as hard as possible, we also increased the default workers (cores x 4 vs cores x 2) and the cpu and ram allocation ratios for the Nova scheduler. We then completed an initial 5000 instance boot/delete benchmark with a single RabbitMQ broker with a concurrency level of 150.  Rally takes this as configuration options for the test runner – in this test Rally executed 150 boot-delete tests in parallel, with 5000 iterations:

action min (sec) avg (sec) max (sec) 90 percentile 95 percentile success count
total 28.197 75.399 220.669 105.064 117.203 100.0% 5000
nova.boot_server 17.607 58.252 208.41 86.347 97.423 100.0% 5000
nova.delete_server 4.826 17.146 134.8 27.391 32.916 100.0% 5000

Having established a baseline for RabbitMQ, we then redeployed and repeated the same test for ZeroMQ; we immediately hit issues with concurrent instance creation.  After some investigation and re-testing, the cause was found to be Neutron’s use of fanout messages for communicating with hypervisor edges; the ZeroMQ driver in Oslo Messaging has an inefficiency in that it creates a new TCP connection for every message it sends – when Neutron attempted to send fanout messages to all hypervisors edges with a concurrency level of anything over 10, the overhead in creating so many TCP connections causes the workers on the Neutron control nodes to back up, and Nova starts to timeout instance creation on network setup.

So the verdict on ZeroMQ scalability with OpenStack? Lots of promise but not there yet….

We introduced a new feature to the OpenStack Charms for Juju in the last charm release to allow use of different RabbitMQ brokers for Nova and Neutron, so we completed one last messaging test to look at this:

action min (sec) avg (sec) max (sec) 90 percentile 95 percentile success count
total 26.073 114.469 309.616 194.727 227.067 98.2% 5000
nova.boot_server 19.9 107.974 303.074 188.491 220.769 98.2% 5000
nova.delete_server 3.726 6.495 11.798 7.851 8.355 98.2% 5000

unfortunately we had some networking problems in the lab which caused some slowdown and errors for instance creation, so this specific test proved a little in-conclusive. However, by running split brokers, we were able to determine that:

  • Neutron peaked at ~10,000 messages/sec
  • Nova peaked at ~600 messages/sec

It’s also worth noting that the SSDs that the m350 cartridges use do make a huge difference, as the servers don’t suffer from the normal iowait times associated with spinning disks.

So in summary, RabbitMQ still remains the de facto choice for messaging in an Ubuntu OpenStack Cloud; it scales vertically very well – add more CPU and memory to your server and you can deal with a larger cloud – and benefits from fast storage.

ZeroMQ has a promising architecture but needs more work in the Oslo Messaging driver layer before it can be considered useful across all OpenStack components.

In my next post we’ll look at how hypervisor choice stacks up…

Tagged , ,

How we scaled OpenStack to launch 168,000 cloud instances

In the run up to the OpenStack summit in Atlanta, the Ubuntu Server team had it’s first opportunity to test OpenStack at real scale.SM_FS_15K_Frt_web-lo

AMD made available 10 SeaMicro 15000 chassis in one of their test labs. Each chassis has 64, 4 core, 2 thread (8 logical cores), 32GB RAM servers with 500G storage attached via a storage fabric controller – creating the potential to scale an OpenStack deployment to a large number of compute nodes in a small rack footprint.

As you would expect, we chose the best tools for deploying OpenStack:

  • MAAS – Metal-as-a-Service, providing commissioning and provisioning of servers.
  • Juju – The service orchestration for Ubuntu, which we use to deploy OpenStack on Ubuntu using the OpenStack charms.
  • OpenStack Icehouse on Ubuntu 14.04 LTS.
  • CirrOS – a small footprint linux based Cloud OS

MAAS has native support for enlisting a full SeaMicro 15k chassis in a single command – all you have to do is provide it with the MAC address of the chassis controller and a username and password.  A few minutes later, all servers in the chassis will be enlisted into MAAS ready for commissioning and deployment:

maas local node-group probe-and-enlist-hardware \
  nodegroup model=seamicro15k mac=00:21:53:13:0e:80 \
  username=admin password=password power_control=restapi2

Juju has been the Ubuntu Server teams preferred method for deploying OpenStack on Ubuntu for as long as I can remember; Juju uses Charms to encapsulate the knowledge of how to deploy each part of OpenStack (a service) and how each service relates to each other – an example would include how Glance relates to MySQL for database storage, Keystone for authentication and authorization and (optionally) Ceph for actual image storage.

Using the charms and Juju, it’s possible to deploy complex OpenStack topologies using bundles, a yaml format for describing how to deploy a set of charms in a given configuration – take a look at the OpenStack bundle we used for this test to get a feel for how this works.

scale_test_juju

Starting out small(ish)

All ten chassis were not all available from the outset of testing, so we started off with two chassis of servers to test and validate that everything was working as designed.   With 128 physical servers, we were able to put together a Neutron based OpenStack deployment with the following services:

  • 1 Juju bootstrap node (used by Juju to control the environment), Ganglia Master server
  • 1 Cloud Controller server
  • 1 MySQL database server
  • 1 RabbitMQ messaging server
  • 1 Keystone server
  • 1 Glance server
  • 3 Ceph storage servers
  • 1 Neutron Gateway network forwarding server
  • 118 Compute servers

We described this deployment using a Juju bundle, and used the juju-deployer tool to bootstrap and deploy the bundle to the MAAS environment controlling the two chassis.  Total deployment time for the two chassis to the point of a OpenStack cloud that was usable was around 35 minutes.

At this point we created 500 tenants in the cloud, each with its own private network (using Neutron), connected to the outside world via a shared public network.  The immediate impact of doing this is that Neutron creates dnsmasq instances, Open vSwitch ports and associated network namespaces on the Neutron Gateway data forwarding server – seeing this many instances of dnsmasq on a single server is impressive – and the server dealt with the load just fine!

Next we started creating instances; we looked at using Rally for this test, but it does not currently support using Neutron for instance creation testing, so we went with a simple shell script that created batches of servers (we used a batch size of 100 instances) and then waited for them to reach the ACTIVE state.  We used the CirrOS cloud image (developed and maintained by the Ubuntu Server teams’ very own Scott Moser) with a custom Nova flavor with only 64 MB of RAM.

We immediately hit our first bottleneck – by default, the Nova daemons on the Cloud Controller server will spawn sub-processes equivalent to the number of cores that the server has.  Neutron does not do this and we started seeing timeouts on the Nova Compute nodes waiting for VIF creation to complete.  Fortunately Neutron in Icehouse has the ability to configure worker threads, so we updated the nova-cloud-controller charm to set this configuration to a sensible default, and provide users of the charm with a configuration option to tweak this setting.  By default, Neutron is configured to match what Nova does, 1 process per core – using the charm configuration this can be scaled up using a simple multiplier – we went for 10 on the Cloud Controller node (80 neutron-server processes, 80 nova-api processes, 80 nova-conductor processes).  This allowed us to resolve the VIF creation timeout issue we hit in Nova.

At around 170 instances per compute server, we hit our next bottleneck; the Neutron agent status on compute nodes started to flap, with agents being marked down as instances were being created.  After some investigation, it turned out that the time required to parse and then update the iptables firewall rules at this instance density took longer than the default agent timeout – hence why agents kept dropping out from Neutrons perspective.  This resulted in virtual interface (VIF) creation timing out and we started to see instance activation failures when trying to create more that a few instances in parallel.  Without an immediate fix for this issue (see bug 1314189), we took the decision to turn Neutron security groups off in the deployment and run without any VIF level iptables security.  This was applied using the nova-compute charm we were using, but is obviously not something that will make it back into the official charm in the Juju charm store.

With the workaround on the Compute servers and we were able to create 27,000 instances on the 118 compute nodes. The API call times to create instances from the testing endpoint remained pretty stable during this test, however as the Nova Compute servers got heavily loaded, the amount of time taken for all instances to reach the ACTIVE state did increase:

Doubling up

At this point AMD had another two chassis racked and ready for use so we tore down the existing two chassis, updated the bundle to target compute services at the two new chassis and re-deployed the environment.  With a total of 256 servers being provisioned in parallel, the servers were up and running within about 60 minutes, however we hit our first bottleneck in Juju.

The OpenStack charm bundle we use has a) quite a few services and b) a-lot of relations between services – Juju was able to deploy the initial services just fine, however when the relations where added, the load on the Juju bootstrap node went very high and the Juju state service on this node started to throw a larger number of errors and became unresponsive – this has been reported back to the Juju core development team (see bug 1318366).

We worked around this bottleneck by bringing up the original two chassis in full, and then adding each new chassis in series to avoid overloading the Juju state server in the same way.  This obviously takes longer (about 35 minutes per chassis) but did allow us to deploy a larger cloud with an extra 128 compute nodes, bringing the total number of compute nodes to 246 (118+128).

And then we hit our next bottleneck…

By default, the RabbitMQ packaging in Ubuntu does not explicitly set a file descriptor ulimit so it picks up the Ubuntu defaults – which are 1024 (soft) and 4096 (hard).  With 256 servers in the deployment, RabbitMQ hits this limit on concurrent connections and stops accepting new ones.  Fortunately it’s possible to raise this limit in /etc/default/rabbitmq-server – and as we were deployed using the rabbitmq-server charm, we were able to update the charm to raise this limit to something sensible (64k) and push that change into the running environment.  RabbitMQ restarted, problem solved.

With the 4 chassis in place, we were able to scale up to 55,000 instances.

Ganglia was letting us know that load on the Nova Cloud Controller during instance setup was extremely high (15-20), so we decided at this point to add another unit to this service:

juju add-unit nova-cloud-controller

and within 15 minutes we had another Cloud Controller server up and running, automatically configured for load balancing of API requests with the existing server and sharing the load for RPC calls via RabbitMQ.   Load dropped, instance setup time decreased, instance creation throughput increased, problem solved.

Whilst we were working through these issues and performing the instance creation, AMD had another two chassis (6 & 7) racked, so we brought them into the deployment adding another 128 compute nodes to the cloud bringing the total to 374.

And then things exploded…

The number of instances that can be created in parallel is driven by two factors – 1) the number of compute nodes and 2) the number of workers across the Nova Cloud Controller servers.  However, with six chassis in place, we were not able to increase the parallel instance creation rate as much as we wanted to without getting connection resets between Neutron (on the Cloud Controllers) and the RabbitMQ broker.

The learning from this is that Neutron+Nova makes for an extremely noisy OpenStack deployment from a messaging perspective, and a single RabbitMQ server appeared to not be able to deal with this load.  This resulted in a large number of instance creation failures so we stopped testing and had a re-think.

A change in direction

After the failure we saw in the existing deployment design, and with more chassis still being racked by our friends at AMD, we still wanted to see how far we could push things; however with Neutron in the design, we could not realistically get past 5-6 chassis of servers, so we took the decision to remove Neutron from the cloud design and run with just Nova networking.

Fortunately this is a simple change to make when deploying OpenStack using charms as the nova-cloud-controller charm has a single configuration option to allow Neutron and Nova networkings to be configured. After tearing down and re-provisioning the 6 chassis:

juju destroy-enviroment maas
juju-deployer --bootstrap -c seamicro.yaml -d trusty-icehouse

with the revised configuration, we were able to create instances in batches of 100 at a respectable throughput of initially 4.5/sec – although this did degrade as load on compute servers went higher.  This allowed us to hit 75,000 running instances (with no failures) in 6hrs 33 mins, pushing through to 100,000 instances in 10hrs 49mins – again with no failures.

100k

As we saw in the smaller test, the API invocation time was fairly constant throughout the test, with the total provisioning time through to ACTIVE state increasing due to loading on the compute nodes:

100k

Status check

OK – so we are now running an OpenStack Cloud on Ubuntu 14.04 across 6 seamicro chassis (1,2,3,5,6,7 – 4 comes later) – a total of 384 servers (give or take one or two which would not provision).  The cumulative load across the cloud at this point was pretty impressive – Ganglia does a pretty good job at charting this:

100k-load

AMD had two more chassis (8 & 9) in the racks which we had enlisted and commissioned, so we pulled them into the deployment as well;  This did take some time – Juju was grinding pretty badly at this point and just running ‘juju add-unit -n 63 nova-compute-b6’ was taking 30 minutes to complete (reported upstream – see bug 1317909).

After a couple of hours we had another ~128 servers in the deployment, so we pushed on and created some more instances through to the 150,000 mark – as the instances where landing on the servers on the 2 new chassis, the load on the individual servers did increase more rapidly so instance creation throughput did slow down faster but the cloud managed the load.

Tipping point?

Prior to starting testing at any scale, we had some issues with one of the chassis (4) which AMD had resolved during testing, so we shoved that back into the cloud as well; after ensuring that the 64 extra servers where reporting correctly to Nova, we started creating instances again.

However, the instances kept scheduling onto the servers in the previous two chassis we added (8 & 9) with the new nodes not getting any instances.  It turned out that the servers in chassis 8 & 9 where AMD based servers with twice the memory capacity; by default, Nova does not look at VCPU usage when making scheduling decisions, so as these 128 servers had more remaining memory capacity that the 64 new servers in chassis 4, they were still being targeted for instances.

Unfortunately I’d hopped onto the plane from Austin to Atlanta for a few hours so I did not notice this – and we hit our first 9 instance failures.  The 128 servers in Chassis 8 and 9 ended up with nearly 400 instances each – severely over-committing on CPU resources.

A few tweaks to the scheduler configuration, specifically turning on the CoreFilter and setting the over commit at x 32, applied to the Cloud Controller nodes using the Juju charm, and instances started to land on the servers in chassis 4.  This seems like a sane thing to do by default, so we will add this to the nova-cloud-controller charm with a configuration knob to allow the over commit to be altered.

At the end of the day we had 168,000 instances running on the cloud – this may have got some coverage during the OpenStack summit….

The last word

Having access to this many real servers allowed us to exercise OpenStack, Juju, MAAS and our reference Charm configurations in a way that we have not been able undertake before.  Exercising infrastructure management tools and configurations at this scale really helps shake out the scale pinch points – in this test we specifically addressed:

  • Worker thread configuration in the nova-cloud-controller charm
  • Bumping open file descriptor ulimits in the rabbitmq-server charm enabled greater concurrent connections
  • Tweaking the maximum number of mysql connections via charm configuration
  • Ensuring that the CoreFilter is enabled to avoid potential extreme overcommit on nova-compute nodes.

There where a few things we could not address during the testing for which we had to find workarounds:

  • Scaling a Neutron base cloud past more than 256 physical servers
  • High instance density on nova-compute nodes with Neutron security groups enabled.
  • High relation creation concurrency in the Juju state server causing failures and poor performance from the juju command line tool.

We have some changes in the pipeline to the nova-cloud-controller and nova-compute charms to make it easier to split Neutron services onto different underlying messaging and database services.  This will allow the messaging load to be spread across different message brokers, which should allow us to scale a Neutron based OpenStack cloud to a much higher level than we achieved during this testing.  We did find a number of other smaller niggles related to scalability – checkout the full list of reported bugs.

And finally some thanks:

  • Blake Rouse for doing the enablement work for the SeaMicro chassis and getting us up and running at the start of the test.
  • Ryan Harper for kicking off the initial bundle configuration development and testing approach (whilst I was taking a break- thanks!) and shaking out the initial kinks.
  • Scott Moser for his enviable scripting skills which made managing so many servers a whole lot easier – MAAS has a great CLI – and for writing CirrOS.
  • Michael Partridge and his team at AMD for getting so many servers racked and stacked in such a short period of time.
  • All of the developers who contribute to OpenStack, MAAS and Juju!

.. you are all completely awesome!

Tagged , , ,

OpenStack 2014.1 for Ubuntu 12.04 and 14.04 LTS

I’m pleased to announce the general availability of OpenStack 2014.1 (Icehouse) in Ubuntu 14.04 LTS and in the Ubuntu Cloud Archive (UCA) for Ubuntu 12.04 LTS.

Users of Ubuntu 14.04 need take no further action other than follow their favourite install guide – but do take some time to checkout the release notes for Ubuntu 14.04.

Ubuntu 12.04 users can enable the Icehouse pocket of the UCA by running:

sudo add-apt-repository cloud-archive:icehouse

The Icehouse pocket of the UCA also includes updates for associated packages including Ceph 0.79 (which will be updated to the Ceph 0.80 Firefly stable release), Open vSwitch 2.0.1, qemu 2.0.0 and libvirt 1.2.2 – you can checkout the full list here.

Thanks goes to all of the people who have contributed to making OpenStack rock this release cycle – both upstream and in Ubuntu!

Remember that you can report bugs on packages from the UCA for Ubuntu 12.04 and from Ubuntu 14.04 using the ubuntu-bug tool – for example:

ubuntu-bug nova

will report the bug in the right place on launchpad and add some basic information about your installation.

The Juju charms for OpenStack have also been updated to support deployment of OpenStack Icehouse on Ubuntu 14.04 and Ubuntu 12.04.  Read the charm release notes for more details on the new features that have been enabled during this development cycle.

Canonical have a more concise install guide in the pipeline for deploying OpenStack using Juju and MAAS  – watch this space for more information…

EOM

 

Tagged , , , ,

Call for testing: Juju and gccgo

Today I uploaded juju-core 1.17.0-0ubuntu2 to the Ubuntu Trusty archive.

This version of the juju-core package provides Juju binaries built using both the golang gc compiler and the gccgo 4.8 compiler that we have for 14.04.

The objective for 14.04 is to have a single toolchain for Go that can support x86, ARM and Power architectures. Currently the only way we can do this is to use gccgo instead of golang-go.

This initial build still only provides packages for x86 and armhf; other architectures will follow once we have sorted out exactly how to provide the ‘go’ tool on platforms other than these.

By default, you’ll still be using the golang gc built binaries; to switch to using the gccgo built versions:

sudo update-alternatives --set juju /usr/lib/juju-1.17.0-gcc/bin/juju

and to switch back:

sudo update-alternatives --set juju /usr/lib/juju-1.17.0/bin/juju

Having both versions available should make diagnosing any gccgo specific issues a bit easier.

To push the local copy of the jujud binary into your environment use:

juju bootstrap --upload-tools

This is not recommended for production use but will ensure that you are testing the gccgo built binaries on both client and server.

Thanks to Dave Cheney and the rest of the Juju development team for all of the work over the last few months to update the codebases for Juju and its dependencies to support gccgo!

Tagged

Targetted machine deployment with Juju

As I blogged previously, its possible to deploy multiple charms to a single physical server using KVM, Juju and MAAS with the virtme charm.

With earlier versions of Juju it was also possible to use the ‘jitsu deploy-to’ hack to deploy multiple charms onto a single server without any separation; however this had some limitations specifically around use of ‘juju add-unit’ which just did crazy things and made this hack not particularly useful in real-world deployments.  It also does not work with the latest versions of Juju which no longer use Zookeeper for co-ordination.

As of the latest release of Juju (available in this PPA and in Ubuntu Saucy), Juju now has native support for specifying which machine a charm should be deployed to:

juju bootstrap --constraints="mem=4G"
juju deploy --to 0 mysql
juju deploy --to 0 rabbitmq-server

This will result in an environment with a bootstrap machine (0) which is also running both mysql and rabbitmq:

$ juju status
machines:
  "0":
    agent-state: started
    agent-version: 1.11.4
    dns-name: 10.5.0.41
    instance-id: 37f3e394-007c-42b9-8bde-c14ae41f50da
    series: precise
    hardware: arch=amd64 cpu-cores=2 mem=4096M
services:
  mysql:
    charm: cs:precise/mysql-26
    exposed: false
    relations:
      cluster:
      - mysql
    units:
      mysql/0:
        agent-state: started
        agent-version: 1.11.4
        machine: "0"
        public-address: 10.5.0.41
  rabbitmq-server:
    charm: cs:precise/rabbitmq-server-12
    exposed: false
    relations:
      cluster:
      - rabbitmq-server
    units:
      rabbitmq-server/0:
        agent-state: started
        agent-version: 1.11.4
        machine: "0"
        public-address: 10.5.0.41

Note that you need to know the identifier of the machine that you are going to “deploy –to” – in all deployments, machine 0 is always the bootstrap node so the above example works nicely.

As of the latest release of Juju, the ‘add-unit’ command also supports the –to option, so its now possible to specifically target machines when expanding service capacity:

juju deploy --constraints="mem=4G" openstack-dashboard
juju add-unit --to 1 rabbitmq-server

I should now have a second machine running both the openstack-dashboard service and a second unit of the rabbitmq-server service:

$ juju status
machines:
  "0":
    agent-state: started
    agent-version: 1.11.4
    dns-name: 10.5.0.44
    instance-id: 99a06a9b-a9f9-4c4a-bce3-3b87fbc869ee
    series: precise
    hardware: arch=amd64 cpu-cores=2 mem=4096M
  "1":
    agent-state: started
    agent-version: 1.11.4
    dns-name: 10.5.0.45
    instance-id: d1c6788a-d120-44c3-8c55-03aece997fd7
    series: precise
    hardware: arch=amd64 cpu-cores=2 mem=4096M
services:
  mysql:
    charm: cs:precise/mysql-26
    exposed: false
    relations:
      cluster:
      - mysql
    units:
      mysql/0:
        agent-state: started
        agent-version: 1.11.4
        machine: "0"
        public-address: 10.5.0.44
  openstack-dashboard:
    charm: cs:precise/openstack-dashboard-9
    exposed: false
    relations:
      cluster:
      - openstack-dashboard
    units:
      openstack-dashboard/0:
        agent-state: started
        agent-version: 1.11.4
        machine: "1"
        public-address: 10.5.0.45
  rabbitmq-server:
    charm: cs:precise/rabbitmq-server-12
    exposed: false
    relations:
      cluster:
      - rabbitmq-server
    units:
      rabbitmq-server/0:
        agent-state: started
        agent-version: 1.11.4
        machine: "0"
        public-address: 10.5.0.44
      rabbitmq-server/1:
        agent-state: started
        agent-version: 1.11.4
        machine: "1"
        public-address: 10.5.0.45

These two features make it much easier to deploy complex services such as OpenStack which use a large number of charms on a limited number of physical servers.

There are still a few gotchas:

  • Charms are running without any separation, so its entirely possible for Charms to stamp all over each others configuration files and try to bind to the same network ports.
  • Not all of the OpenStack Charms are compatible with the latest version of Juju – this is being worked on – checkout the OpenStack Charmers branches on Launchpad.

Juju is due to deliver a feature that will provide full separation of services using containers which will resolve the separation challenge.

For the OpenStack Charms, the OpenStack Charmers team will be aiming to limit file-system conflicts as much as possible – specifically in charms that won’t work well in containers such as nova-compute, ceph and quantumneutron-gateway because they make direct use of kernel features and network/storage devices.

Mixing physical and virtual servers with Juju and MAAS

This is one the most common questions I get asked about deploying OpenStack on Ubuntu using Juju and MAAS is:

How can we reduce the number of servers required to deploy a small OpenStack Cloud?

OpenStack has a number of lighter weight services which don’t really make best use of anything other than the cheapest of cheap servers in this type of deployment; this includes the cinder, glance, keystone, nova-cloud-controller, swift-proxy, rabbitmq-server and mysql charms.

Ultimately Juju will solve the problem of service density in physical server deployments by natively supporting deployment of multiple charms onto the same physical servers; but in the interim I’ve hacked together a Juju charm, “virtme”, which can be deployed using Juju and MAAS to virtualize a physical server into a number of KVM instances which are also managed by MAAS.

Using this charm in conjunction with juju-jitsu allows you to make the most of a limited number of physical servers; I’ve been using this charm in a raring based Juju + MAAS environment:

juju bootstrap
(mkdir -p raring; cd raring; bzr branch lp:~virtual-maasers/charms/precise/virtme/trunk virtme)
jitsu deploy-to 0 --config config.yaml --repository . local:virtme

Some time later you should have an additional 7 servers registered into the MAAS controlling the environment ready for use. The virtme charm is deployed directly to the bootstrap node in the environment – so at this point the environment is using just one physical server.

The config.yaml file contains some general configuration for virtme:

virtme:
  maas-url: "http://<maas_hostname>/MAAS"
  maas-credentials: "<maas_token>"
  ports: "em2"
  vm-ports-per-net: 2
  vm-memory: 4096
  vm-cpus: 2
  num-vms: 7
  vm-disks: "10G 60G"

virtme uses OpenvSwitch to provide bridging between KVM instances and the physical network; right now this requires a dedicated port on the server to be cabled correctly – this is configured using ‘ports’. Each KVM instance will be configured with ‘vm-ports-per-net’ number of network ports on the OpenvSwitch bridge.

virtme also requires a URL and credentials for the MAAS cluster controller managing the environment; it uses this to register the details of the KVM instances it creates back into MAAS. Power control is supported using libvirt; virtme configures the libvirt daemon on the physical server to listen on the network and MAAS uses this to power control the KVM instances.

Right now the specification of the KVM instances is a little clunky – in the example above, virtme will create 7 instances with 2 vCPUS, 4096MB of memory and two disks, a root partition that is 10G and a secondary disk of 60G. I’d like to refactor this into something a little more rich to describe instances; maybe something like:

vms:
  small:
    - count: 7
    - cpu: 2
    - mem: 4096
    - networks: [ eth1, eth2 ]
    - disks: [ 10G, 20G ]

Now that the environment has a number of smaller, virtualized instances, I can deploy some OpenStack services onto these units:

juju deploy keystone
juju deploy mysql
juju deploy glance
juju deploy rabbitmq-server
....

leaving your bigger servers free to use for nova-compute:

juju deploy -n 6 --constraints="mem=96G" nova-compute

WARNING: right now libvirt is configured with no authentication or security on its network connection; this has obvious security implications! Future iterations of this charm will probably support SASL or SSH based security.

BOOTNOTE: virtme is still work-in-progress and is likely to change; if you find it useful let me know about what you like/hate!

Tagged , ,

Wrestling the Cephalopod

Ceph is a distributed storage and network file system designed to provide excellent performance, reliability, and scalability.

Sounds pretty cool right?

Ceph is forging the way in delivering petabyte/exabyte scale storage to thousands of clients using commodity hardware.

This post outlines some of the key activities that the Ubuntu Server Team have undertaken during the Ubuntu 12.10 development cycle to improve the Ceph experience on Ubuntu.

Chasing the Argonaut

Ubuntu 12.10 features Ceph 0.48.2 ‘Argonaut’, the first release of Ceph with long-term support.

While development continues at a blistering pace and new releases will contain new features, the 0.48.x series will only receive critical bug-fixes and stability improvements.

This is a really important step for Ceph deployments; having a stable, supported release to baseline on is critical to the operation and stability of production environments.

For more information on the 0.48.x releases, see the release notes for Ceph.

The ‘Missing Bits’

For Ubuntu 12.04, Ceph was included in Ubuntu ‘main’ which means that it receives an increased level of focus from both the Ubuntu Server and Security teams (underwritten by Canonical) for the lifecycle of the Ubuntu release.  However, to make this happen for the 12.04 release, some features of the packaging had to be disabled.

The good news is that those missing features have now been re-enabled in Ubuntu 12.10:

  • The RADOS Gateway (radosgw) provides a RESTful, S3 and Swift compatible gateway for storage and retrieval of objects in a Ceph cluster.
  • Ceph now uses Google Perftools (gperftools) on x86 architectures, providing higher performance memory allocation.

This re-aligns the Ubuntu packaging with the packages available directly from Ceph and in Debian.

Juju Deployment

Ceph can now be deployed effectively using Juju, the service orchestration tool for Ubuntu Server.

The Ceph charms for Juju build upon the automation work done by Tommi Virtanen from Inktank (who I think should win an award for his innovative use of Upstart for bootstrapping Ceph Object Storage Daemons).

The charms are still pending review for entry into the Juju Charm Store as the official charms but if you want to try them out:

cat > config.yaml << EOF
ceph:
  fsid: ecbb8960-0e21-11e2-b495-83a88f44db01 
  monitor-secret: AQD1P2xQiKglDhAA4NGUF5j38Mhq56qwz+45wg==
  osd-devices: /dev/vdb /dev/vdc /dev/vdd /dev/vde
  ephemeral-unmount: /mnt
EOF
juju deploy -n 3 --config config.yaml --constraints="cpu=2" cs:~james-page/quantal/ceph

Some time later you should have a small three node Ceph cluster up and running.  You can then expand it with further storage nodes:

cat >> config.yaml << EOF
ceph-osd:
  osd-devices: /dev/vdb /dev/vdc /dev/vdd /dev/vde
  ephemeral-unmount: /mnt
EOF
juju deploy -n 3 --config config.yaml --constraints="cpu=2" cs:~james-page/quantal/ceph-osd
juju add-relation ceph ceph-osd

And then add a RADOS Gateway for RESTful access:

juju deploy cs:~james-page/quantal/ceph-radosgw
juju add-relation ceph ceph-radosgw
juju expose ceph-radosgw

The ceph-radosgw charm can also be scaled-out and fronted with haproxy:

juju add-unit -n 2 ceph-radosgw
juju deploy cs:precise/haproxy
juju add-relation haproxy ceph-radosgw

You should now have a deployment that looks something like this (click to explode):

Note that the above examples assume that you have a Juju environment already configured and bootstrapped – if you have not read this.

The ceph and ceph-osd charms require additional block storage devices to work correctly so will not work with the Juju local provider; they have been tested in OpenStack, ec2 and MAAS environments and generally work OK (aside from one issue when ec2 instances get domU-XX hostnames).

All of the charms have README’s – take a look to find out more.

Credit to Paul Collins from the Canonical IS Projects team for initial work on the ceph charm.

OpenStack Integration

OpenStack provides direct integration with Ceph in two ways:

  • Glance: storage of images that will be used for virtual machine instances in the cloud
  • Volumes: persistent block storage devices which can be attached to virtual machine instances

Due to the scalable, resilient nature of Ceph, integration with OpenStack presents a compelling proposition.

Sebastien Han has already done a great job of explaining how to configure and use these features in OpenStack so I’m not going to go into the finer details here.

The OpenStack Juju charms for Ubuntu 12.10 will be updated to optionally use Ceph as a block and object storage back-end; here’s a preview:

juju add-relation glance ceph
juju add-relation nova-volume ceph
juju add-relation nova-compute ceph

Job done…

What’s next?

Ceph plans for the next Ubuntu release might include:

  • Daily automated testing of Ceph on Ubuntu; the test is written, it just needs automating.
  • Making Ceph part of the per-commit testing of OpenStack that we do on Ubuntu.
  • Updating to the next Ceph LTS release.
  • Improving the out-of-the box configuration of the RADOS Gateway.
  • Using upstart configurations by default in the packaging.
  • Figuring out how to deliver Ceph releases to Ubuntu 12.04 so users who want to stick on the Ubuntu LTS can use the Ceph LTS.

Follow the Ceph Blueprint and UDS-R session to see how this pans out.