On the Monday of the project teams gathering in Dublin a now somewhat familiar gathering of developers and operators got together to discuss upgrades – specifically fast forward upgrades but discussion over the day drifted into rolling upgrades and how to minimize downtime in supporting components as well. This discussion has been a regular feature over the last 18 months at PTG’s, Forums and Ops Meetups.
Fast Forward Upgrades?
So what is a fast forward upgrade? A fast forward upgrade takes an OpenStack deployment through multiple OpenStack releases without the requirement to run agents/daemons at each upgrade step; it does not allow you to skip an OpenStack release – the process allows you to just not run a release as you pass through it. This enables operators using older OpenStack releases to catch up with the latest OpenStack release in as short an amount of time as possible, accepting the compromise that the cloud control plane is down during the upgrade process.
This is somewhat adjunct to a rolling upgrade, where access to the control plane of the cloud is maintained during the upgrade process by upgrading units of a specific service individually, and leveraging database migration approaches such as expand/migrate/contract (EMC) to provide as seamless an upgrade process as possible for an OpenStack cloud. In common with fast forward upgrades, releases cannot be skipped.
Both upgrade approaches specifically aim to not disrupt the data plane of the cloud – instances, networking and storage – however this may be unavoidable if components such as Open vSwitch and the Linux kernel need to be upgraded as part of the upgrade process.
Deployment Project Updates
The TripleO team have been working towards fast forward upgrades during the Queens cycle and have a ‘pretty well defined model’ for what they’re aiming for with their upgrade process. They still have some challenges around ordering to minimize downtime specifically around Linux and OVS upgrades.
The OpenStack Ansible team gave an update – they have a concept of ‘leap upgrades’ which is similar to fast-forward upgrades – this work appears to lag behind the main upgrade path for OSA, which is a rolling upgrade approach which aims to be 100% online.
The OpenStack Charms team still continue to have a primary upgrade focus on rolling upgrades, minimizing downtime as much as possible for both the control and data plane of the Cloud. The primary focus for this team right now is supporting upgrades of the underlying Ubuntu OS between LTS releases with the imminent release of 18.04 on the horizon in April 2018, so no immediate work is planned on adopting fast-forward upgrades.
The Kolla team also have a primary focus on rolling upgrades, for which support starts at OpenStack Queens or later. There was some general discussion around automated configuration generation using Oslo to ease migration between OpenStack releases.
No one was present to represent the OpenStack Helm team.
Keeping Networking Alive
Challenges around keeping the Neutron data-plane alive during an upgrade where discussed – this included:
- Minimising Open vSwitch downtime by saving and restoring flows.
- Use of the ‘neutron-ha-tool’ from AT&T to manage routers across network nodes during an OpenStack cloud upgrade – there was also a bit of bike shedding on approaches to Neutron router HA in larger clouds. Plan are afoot to endeavor to make this part of the neutron code base.
We had a specific slot to discuss upgrade Ceph as part of an OpenStack Cloud upgrade; some deployment projects upgrade Ceph first (Charms), some last (TripleO) but there was general agreement that Ceph upgrades are pretty much always a rolling upgrade – i.e. no disruption to the storage services being provided. Generally there seems to be less pain in this area so it was not a long session.
A number of operators shared experiences of walking their OpenStack deployments through fast forward upgrades including some of the gotchas and trip hazards encountered.
Oath provided a lot of feedback on their experience of fast-forward upgrading their cloud from Juno to Ocata which included some increased complexity due to the move to using cells internally for Ocata. Ensuring compatibility between OpenStack and supporting projects was one challenge encountered – for example, snapshots worked fine with Juno and Libvirt 1.5.3, however on upgrade live snapshots where broken until Libvirt was upgraded to 2.9.0. Not all test combinations are covered in the gate!
Some of these have been shared on the OpenStack Wiki.
Upgrade discussion has become a regular fixture at PTG’s, Forums, Summits and Meetups over the last few years; getting it right is tricky and the general feeling in the session was that this is something that we should talk about more between events.
The formation of an Upgrade SIG was proposed and supported by key participants in the session. The objective of the SIG is to improve the overall upgrade process for OpenStack Clouds, covering both offline ‘fast-forward’ and online ‘rolling’ upgrades by providing a forum for cross-project collaboration between operators and developers to document and codify best practice for upgrading OpenStack.
The SIG will initially be led by Lujin Luo (Fujitsu), Lee Yarwood (Redhat) and myself (Canonical) – we’ll be sorting out the schedule for bi-weekly IRC meetings in the next week or so – OpenStack operators and developers from across all projects are invited to participate in the SIG and help move OpenStack life cycle management forward!