On Wed, Aug 29, 2012 at 10:52 AM, Ryan Lane <rla...@wikimedia.org> wrote:
There was talk of trying to set up test infrastructure that would roll out Essex and then upgrade it to Folsom in some automated fashion so we could start learning where it breaks. Was there any forward momentum on that?
This would be awesome. Wrapping automated tests around upgrades would
greatly improve the situation. Most of the issues that ops runs into
during upgrades are unexpected changes, which are the same things that
will likely be hit when testing upgrades in an automated way.
It would be fascinating (for me at least :)) to know the upgrade
process you use - how many stages you use, do you have multiple
regions and use one/some as canaries? Does the downtime required to do
an upgrade affect you? Do you run skewed versions (e.g. folsom nova,
essex glance) or do you do lock-step upgrades of all the components?
For Launchpad we've been moving more and more to a model of permitting
temporary skew so that we can do rolling upgrades of the component
services. That seems in-principle doable here - and could make it
easier to smoothly transition between versions, at the cost of a
(small) amount of attention to detail while writing changes to the