Why we need a special flavor of DevOps for mature IT

If there is one thing that has struck me from my journey optimizing the software delivery process of a traditional enterprise and the feedback I get from the people that are involved in the DevOps community – the majority working in agile contexts – it is that traditional enterprises work very differently from the modern high-performing companies.

I have already largely discussed this topic in a previous post where I took this difference as a given and simply wondered if we could apply the patterns that make these modern companies so successful also to the traditional enterprises.

Here is a slide from my DevOpsDays London talk that summarizes their differences:

TraditionalVsModernCompanyIn this post I will take it one step further and ask the question why they are so different in the first place. And, if this difference is only “cultural” rather than “genetic”, whether it is possible to transform the traditional enterprises to modern ones, instead of awkwardly trying to squeeze these successful patterns (I’m talking about agile, DevOps and CD here) in to them.

Value add vs utility

A comment of Steve Smith on my blog led me to an interesting article by Martin Fowler in which he refers to another interesting article by Ross Pettit. Both posts give an indication on a possible root cause for this difference. The argument goes that IT projects should best be split up in two distinct categories: projects that add value and projects that merely provide a utility. The main difference between the two is that the value adding projects aim to strategically differentiate the company over its competitors and are therefore critical for its overall success. Utility projects aim to deliver functions that are needed for the company to run its business but that are not directly related to its success.

According to the articles, both types of projects must be treated in a very different way: value adding projects must deliver the best possible result (these are where you should put your money) and utility projects must deliver the cheapest possible result that gets the job done. The underlying reasoning is that if you do a value adding project slightly better you will get a considerably higher gain and if you do a utility project a lot better you will get a very small gain.

And because of their different nature also the IT side of both types of projects must be managed very differently: value adding projects are all about agility and time-to-market: if you can be the first one to deliver a new function (or “get it right”) you will have an edge over your competitors. Utility projects are all about reducing risks and increasing stability: you don’t want the function to break because then it will cost you a lot of money but you are OK that it works just good enough.

In the end it’s the CEO who decides which projects are value adding (or strategic) and which ones are utility, keeping in mind that projects can move from one category to the other over time, e.g. a function that was initially considered strategic may be switched to a utility if it turns out that the competition has closed the gap. The articles estimate that between 70 and 95% of all projects in a company is utility.

Traditional enterprises have more utility

OK, back now to our initial question: what does all this have to do with our search for the root causes of the differences between traditional and modern companies?

I think there are several interesting points here:

  1. Traditional companies are obviously a lot older than modern ones so they have had plenty of time to grow several layers of “value adding turned into utility” software, each layer representing a different period in the past. This software will most probably not have decent support for automation of testing, deployment etc. (although this may be different for the current state-of-the-art software that will be utility in a couple of years).
  2. Modern companies probably have more modern utility packages that may provide better support for automated testing and deployment, have a lightweight decoupled architecture and decent integration points. Traditional companies will probable depend mainly on old packages that lack all of this fancy stuff.

My main take-away point from the articles is that traditional companies have considerably more utility software than modern ones, more to the 95% than to the 70%, and in addition this utility software is older and has therefore less support for automation.

If we want to modernize the traditional enterprise the big question now becomes what should we do with all this utility software? Should we still try to automate its software delivery process? As we just saw one of the properties of utility software was that we should keep it low cost because it doesn’t bring us any noteworthy strategic advantage. Then maybe we should just leave utility software out of the equation – accepting a low and painful change rate for them – and concentrate instead on value adding software?

I can’t say.

One last remark on this topic: it looks to me that we will likely see the inverse process happening with the modern companies of today becoming the traditional ones of tomorrow, built on technologies that will not support out of the box whatever new requirements will be state-of-the-art in the future and they as well will end up with a large stack of “old” utility software. (Note to the ruby, nodejs and go developers: enjoy your status of cool kids while you can because in a couple of years we will all look down to you as old-school developers :-))

Shortcomings of the model

This simple metaphor of utility and value has not been able to give us a definitive answer so far. Maybe it just doesn’t account for the complex reality that we are faced with when looking for our answers? Maybe we should look to extend its model or replace it with a better one, and see what this gives?

I have found a number of important shortcomings that I would like the new model to take into account.

To start with, the original model assumes that a competitive advantage can only be reached through value adding projects. But many companies use low-cost as their competitive advantage. And in mature markets there is very little opportunity to explore new territory so consistently finding competitive advantages there may prove to be very difficult. All this makes that the strategic focus will be more on finding ways to efficiently run utility projects as well as on finding the best sales and marketing guys to differentiate your product. On the other extreme, value adding projects don’t always have the intention to lead to a direct competitive advantage: some companies have a big R&D department which is all about adding value on the long run but is not necessarily a strategic differentiator at the current moment.

Secondly, an important fact is not taken into account: value added projects may have critical dependencies on the “value add turned utility” projects. These last projects were there first and therefore represent the core of the application landscape. So they too should be kept in a modern, releasable state. But exactly because many of the younger projects depend on them they should focus on avoiding risk. Because of their maturity they have less need for flexibility and time-to-market but instead could take a more steady path of evolution, which resonates well with risk-aversion (especially taking into account their low-automation context).

Layers of software by age, similar to the geologic layers of the earth, and their dependencies, indicated by the dashed lines:

Layers of software by age.001Finally, the articles make the implicit assumption that value adding projects are all about trying to find ways to innovate, in an attempt to keep an edge over the competition. I believe that there a many markets where this is not the case. Markets with high entrance barriers or that favor monopolistic behavior are examples where added value remains stable for longer periods. I argue that these “mature” value adding projects could as well be treated as “value add turned utility” projects. Because they are so similar it would be hard anyway to draw a line between them.

Introducing a new category

In the new model I’m going to introduce a category “core” that represents the “value add turned utility” and “mature value added” projects, while the new value added category is now limited to the evolutive value adding projects:New model - Utility-Core-Value addAs with the other two categories, this core category also has a very specific nature and must therefore be treated in its own way. Compared to value adding projects the core projects have more dependencies and should therefore focus more on risk-aversion. As they are more mature they have less need for creative ad hoc thinking and quick time-to-market. They are less likely to deliver high returns so cost effectiveness should also get a bigger focus. For these reasons they may be a better fit for process-centric approaches like ITIL and CMMi (one of the two M’s stands for … maturity) than the value added projects.

On the other hand they are too important to be merely considered utility projects and be left over to the destructive forces of cost efficiency and out-sourcing. Killer features get the attention of your customers but if you don’t have a high-quality application around it they will not get you very far.

Once we have accepted that each category is indeed best off getting a dedicated treatment, we should think about how to make the value adding and the core projects co-exist with each other, considering how interdependent they are. Similar to how value adding projects can put a high pressure on the ops side to deliver their features to production, the same can happen to the core projects whenever the value adding projects need them to implement features they depend on. And this issue will remain, even if both project types are independent from each other in terms of infrastructure, project methodology, technology choices, compliance, release process, etc.

Another attention point is how we should handle the transition from value add to core and when exactly this should happen during the coming to age of the project. In rigid corporate environments these are non-trivial problems to solve, especially if there is a big gap between how these two types of projects are managed.

DevOps for mature IT

At the moment DevOps is focused on the evolutive value add category which is only a small subset of all projects. There is another very important category, the core one, that has its own specific treatment. As it represents the bulk of the applications in traditional enterprises and a considerable percentage in even the modern ones, its software delivery process also deserves our full attention. We should therefore try to improve it by means of the three levels of DevOps: communication, process and automation. But this time we should use the context of these core projects as the starting point. Maybe we should even create a special flavor of DevOps for this category? Let us call it “DevOps for mature IT”!

While writing this post I learned that traditional enterprises don’t only use core/utility projects and modern companies don’t only use value adding ones. No, they use both project types, but just in varying degrees. So whatever solution we come up with to improve the software delivery process of a company should therefore support both value adding and core projects, including the integrations between the two.

Wrap up

We have started with the question why traditional enterprises are so different from modern ones. We have looked at a simple though powerful model that splits up projects into utility and value add ones and have concluded that traditional enterprises have more utility than modern ones.

But the model was not conclusive as to how to treat these utility projects. We found some shortcomings which we then solved by extending the model with a core category. We concluded that this category is characterized by mature and interdependent applications that are critical for the success of the company. Therefore its software delivery process also needs our full attention. But because its context is so different from the value adding category we should create a special flavor of DevOps for it which we called “DevOps for mature IT”.

Advertisement

Implementing a release management solution in a traditional enterprise

Welcome to my fourth and final post in this blog post series on introducing DevOps in a traditional enterprise.

In my previous post I explained how the implementation of a configuration management system allowed us to get a better visibility on the different versions of the components that were built and on the new features they contained. In this post I will take it one step further and explain how we improved our release management efforts and more specifically why we needed orchestrated releases, how they compare to continuous deployments and how we selected and implemented a tool to facilitate us with these releases.

(For the sake of completeness, here are the links to the introduction and an analysis of the problems)

A need for orchestration

As a small reminder from my earlier posts, the applications in our company were highly integrated with each other and therefore most of the new features required modifications in multiple applications, each of which managed by its own dedicated development team. Simple example: an application needs to display additional information about a financial instrument. This would not only require a modification to the application itself but also to the central data hub and to the application that receives this information from the outside and acts as the owner of this information.

This seemingly innocent fact had a rather big consequence: it resulted in the need for a release process that is capable of deploying the modified components of each of these impacted applications in the same go, in other words an orchestrated release process. If the component would be deployed on different days it would break or at least disrupt the applications.

OrchestraIn fact, even if a feature can be implemented by only modifying components within one application at least some degree of orchestration is necessary (a good example is a modification in the source code and one in the database schema). But as this would stay within the boundaries of one development team it would be relatively easy to organize that all components are deployed simultaneously.

Things become way more complicated when more than one application and development team is involved in implementing a feature and moreover when multiple of these application-overlapping features are being worked on in parallel and thereby modifying the same components. Such a situation really shouts for a company-wide release calendar that defines the exact dates when components can be deployed in each environment. If we deploy all components at the same moment we are sure that all the dependencies between them are taken into account.

Unfortunately we all know that creating strict deadlines in a context so complex and so difficult to estimate as software development can cause a lot of stress. And cause a lot of missed deadlines as well, resulting in half-finished features that must somehow be taken out of the ongoing release and postponed to the next release. And a lot of the dependent features that must be taken out as well. Or features that can stay in but were rushed so much to get them finished before the deadline that they lack the quality that is expected from them, causing system outages, urgent bug fixes that must be released short after the initial release, loss of reputation, and so on downstream in the process. And even if all goes according to the plan, there is generally a lot of unfinished work waiting somewhere in the pipeline for one or another deployment (development-done but not release-done, also called WIP or work in progress).

What about continuous deployment?

Continuous deployment takes another approach to avoid all these problems that are caused by these dependencies: it requires the developers to support backward-compatibility, in other words their component should work with the original versions as well as with the modified versions (those that include the new feature) of the other components, simply because they don’t know in advance in which order the components will be deployed. In such a context the developers can do their work on their own pace and deliver their components whenever they are ready. As soon as all the new versions of the components are released a feature flag can be switched on to activate the feature.

There is unavoidably a huge drawback to this scenario: in order to support this backward compatibility the developers basically have to keep two versions of the logic in his code, and this applies to each individual feature that is being worked on. This requirement can be a big deal, especially if the code is not well-organized. Once the feature is activated they should also not forget to remove the old version to avoid that the code base becomes a total mess after a while. If there are changes to the database scheme (or other stateful resources) and/or data migrations involved things will become even more complicated. Continuous deployment is also tightly coupled to hot deployment, which also introduces quite some challenges on its own and if the company doesn’t have a business need to be up all the time that’s a bit of a wasted effort. I found a nice explanation of all the intricacies of such practices in this webinar by Paul Biggar at MountainWest RubyConf 2013.

But don’t get me wrong on this, continuous deployment is great and I really see it as the way to go but that doesn’t take away that it will take a long time for the “conventional” development community to switch to a mindset that is so extremely different from what everyone has been used to for so long. For this community (the 99%-ers ;-)) continuous deployment merely appears as a small dot far away on the horizon, practiced by whizkids and Einsteins living on a different planet. But hopefully, if we can gradually bring our conventional software delivery process under control, automating where possible and gradually increasing the release frequency, maybe one day the leap towards continuous deployment may be less daunting than it is today and instead just be the next incremental step in the process.

But until then I’m afraid we are stuck with our orchestrated release process so we better make sure that we bring the processes and dataflows under control so we can keep all the aforementioned problems it brings with it to a minimum.

Some words about the process of release coordination

Let us have a closer look at the orchestrated release process in our company, starting from the moment the components are delivered to the software repository (typically done by a continuous integration tool) to the moment they are released to production.

The first step was for the dev team to create a deployment request for their application. This request should contain all the components – including the correct version numbers – that implement the features that were planned for the ongoing release.

Each team would then send their deployment request to the release coordinator (a role on enterprise level) for the deployment of their application in the first “orchestrated” environment, in our case the UAT environment. Note that we also had an integration environment between development and UAT where cross-application testing – amongst other types of testing – happened but this environment was still in the hands of the development teams in terms of deciding when to install their components.

The release coordinator would review the contents of the deployment requests and verify that the associated features were signed-off in the previous testing stage, e.g. system testing. Finally he would assemble all deployment requests into a global release plan and include some global pre-steps (like taking a backup of all databases and bringing all services down) and post-steps (like restarting all services and adding an end-to-end smoke-test task). On the planned day of the UAT deployment, he would coordinate the deployments of the requests with the assigned ops teams and inform the dev teams as soon as the UAT environment was back up and running.

When a bug was found during the UAT testing, the developer would fix it and send in an updated deployment request, one where the fixed components got an update version number and were highlighted for redeployment.

The deployment request for production would then simply be a merge of the initial deployment request and all “bugfix” deployment requests, each time keeping the last deployed version of a component. Except if it’s a stateful components like a database, in which case deployments are typically incremental and as a result all correction deployments that happened in UAT must be replayed in production. Stateless components like the ones that are built from source code are typically deployed by completely overwriting the previous version.

Again, the release coordinator would review and coordinate the deployment requests for production, similar to how it was done for UAT and that would finally deliver the features into production, after a long a stressful period.

Note that I only touched the positive path here. The complete process was a lot more complex and included scenarios like what to do in case a deployment failed, how to treat rollbacks, how to support hotfix releases that happen while a regular release is ongoing, etc.

For more information on this topic you should definitely check out Eric Minick’s webinar on introducing uRelease in which he does a great job explaining some of the common release orchestration pattern that exist in traditional enterprises.

As long as there were relatively few dev and ops teams and all were mostly co-located this process could still be managed by a combination of Excel, Word, e-mail, and a lot of plain simple human communication. However, as the IT department grew over time and became more spread out over different locations, this “artisanal” approach hit its limits and a more “industrial” solution was needed.

A big hole in the market

Initially we thought about developing a small release management tool in-house to solve these scaling problems but as soon as we sat down to gather the high-level requirements we realized that it would take our small team too much time and effort to develop it and to maintain it later on.

So we decided to go on the market and look for a commercial tool, only to be surprised about the lack of choice there was. Yes, you have the ITIL-based ITSM tools like BMC Remedy and ServiceNow but these tools only touch the surface of release management. At the other side of the spectrum you have the CI or push button triggered automated deployment tools that focus on the technical aspects but have little or no notion of orchestrated releases, let alone all of the requirements I mentioned above. Nonetheless, plenty of choice here, depending on your specific needs: XebiaLabs Deployit, ZeroTurnaround LiveRebel, Nolio, Ansible, glu, Octopus Deploy (.NET only), Rundeck, Thoughtworks Go, Inedo BuidMaster (the last two include their own CI server).

Instead, what we needed was a tool that would map closely to the existing manual processes and that could simply take over the administrative, boring, error prone and repetitive work of the release coordinator and provide an integrated view of the situation – linking together features, components, version numbers, releases, deployments, bug reports, environments, servers, etc – rather than a tool that would completely replace the human release coordination work in one step.

We were therefore very happy to find SmartRelease, a tool that nicely fit this gap. It was created by Streamstep, a small US-based start-up that was co-founded by Clyde Logue, a former release manager and a great person I had the opportunity to work with throughout the implementation of the tool. During this period the company was sold to BMC and renamed to Release Process Manager (RPM). But other than this tool nothing else came close to our requirements, something I found quite surprising considering the struggles that companies with many integrated applications typically have with their releases.

This was about three years ago. Looking at the market today, I see that some interesting tools have been introduced (or reached maturity) since then. Not only has SmartRelease/RPM been steadily extended with new features, also UrbanCode has come up with a new tool uRelease, Serena has pushed its Release Manager and UC4 has Release Orchestrator. Although none of them comes close yet to the ideal tool I have in mind (and I’m especially thinking about managing the dependencies between the features and components here) I don’t think it will take too long before one of them will close the gap.

On the other hand, where I do see a big hole in the market today is on the low-budget do-it-yourself side. Looking at how these four companies present their tools, I noticed that none of them make a trial version available that can be downloaded and installed without too much ceremony. This leads me to believe that they are mainly focusing on the “high-profile” companies where the general sales strategy is to send out a battery of sales people, account managers and technical consultant to try to seduce higher management for the need of their tools so they can send another battery of consultants to come in and do the implementation. If this manager is well aware of the situation on the work floor this may actually be a successful strategy.

For the devs or ops people from the work floor, the ability to play around with a tool, to set it up for a first project and to see if it really brings any value to the company somewhere in a dark corner outside of the spotlights is invaluable. And once they decide to make their move and ask for the budget they already have some real facts to prove their business case. I believe that there is a big opening here for tools that are similar to what Jenkins and TeamCity are to continuous integration: being easily accessible, creating a community around them and as such making the concept – in this case release coordination – mainstream.

Putting the tool to work

Back to our chosen tool Release Process Manager, let us have a look at how exactly it helped us industrialize our release management process.

First of all, it allows the release manager (typically a more senior role than the release coordinator) to configure her releases, and this includes specifying the applicable deployment dates for each environment. It also allows the developers to create a deployment request for a particular application and release which contains a set of deployment steps, one for each component that must be deployed. These steps can be configured to run sequentially or in parallel and manual (as in our case, thanks for not allowing access to the production servers Security team ;-)) or in automated fashion.

The deployment date the deployment requests for a particular release can then be grouped into a release plan – typically after all deployment requests are received – that allows to create release-specific handling like adding pre and post deployment steps.

And finally during the day of the deployment the release plan will be executed, going over each deployment step of each deployment request step either sequentially or in parallel. For each manual step, the ops team responsible for the deployment of the associated component will receive a notification and will be able to indicate success or failure. For each automated step an associated script is run that will take care of the deployment.

See here a mockup of how a deployment request looks like:

Mockups - release managementAs part of the implementation project in our company the tool was integrated with two other already existing tools: the change management tool and the software repository:

It was notified by the change management tool whenever a change request (or feature) was updated. This allowed the tool to show the change requests that applied to that particular application and release directly on the deployment request which made it possible for the release coordinator to easily track the the statuses of the change requests and e.g. reject the deployment requests that contain not yet signed-off change requests.

It was notified by the software repository whenever a new version of a component was built which allowed the tool to restrict the choices of the components and their version number to only those that actually exist in the software repository.

See here an overview of the integration of the release management tool with the other tools:

ReleaseManagementIntegrationMore in general, by implementing a tool for release management rather than relying on manual efforts it became possible to increase the quality of the information – this was done by either enforcing that correct input data is introduced or by validating it a posteriori through reporting – and to provide a better visibility on the progress of the releases.

As an added bonus all actions are logged which allows for more informed post-mortem meetings etc. It also opens the way to create any kind of reporting on statuses, metrics, trends, KPI’s etc. that may become necessary in the future.

In a later stage a third integration was created, from the configuration management tool to the release management tool this time, which allowed the configuration management tool to retrieve the versions of the deployed components in each environment and include it in the view of the component (see my previous post for more info on this topic).

Another feature that I found very interesting but which was not yet supported by the release management tool was the possibility to automatically create of a first draft of the deployment request from the configuration management tool. I can easily image the developer opening up the configuration management view of his application, selecting the versions of the components he wants to include in his deployment request and pressing the button “Generate deployment request”. And if the latest versions that include change requests for the next release are selected by default most of the time the effort could be reduced to just pressing the button.

Conclusion

To conclude this blog post series, let us now take a step back and see which problems that were identified initially (more details in this earlier post) got solved by implementing a configuration management tool and a release management tool:

  • The exponentially increasing need for release coordination, which was characterized by it’s manual nature, and the pressure it exerted on the deployment window => SOLVED
  • The inconsistent, vague, incomplete and/or erroneous deployment instructions => SOLVED
  • The problems surrounding configuration management: not being enough under control and allowing for a too permissive structure => SOLVED
  • The manual nature of testing happening at the end of a long process meaning that it must absorb all upstream planning issues – this eventually causes the removal of any not signed-off change requests => NOT SOLVED YET
  • The desire by the developers to sneak in late features into the release and thereby bypassing the validations => SOLVED

Looks good doesn’t it? Of course this doesn’t mean that what was implemented is perfect and doesn’t need further improvement. But for a first step it solved quite a number of urgent and important problems. It is time now for these tools to settle down and to put them under the scrutiny of continuous improvement before heading to the next level.

Wow, it has taken me a lot more time than I initially estimated to write down my experiences in a blog post series but I’m happy that I have finally done it. Ever since publishing my first post in the series I have received lots of feedback from people with all different kinds of backgrounds, heard a lot of similar stories also and got many new insights on the way.

I hope you have enjoyed reading the blog post series, hopefully as much as I have enjoyed writing it 🙂

Previous posts in the series:

First steps in DevOps-ifying a traditional enterprise

In this post I will talk about the first steps we took in “DevOps-ifying” the large traditional company I have worked for but before kicking off let me first shortly summarize the previous two posts (intro and problems) of the series:

I started off with a theoretical approach towards DevOps, explaining the two three (*) business benefits:

  1. The business wants the software to change as fast as possible
  2. The business wants the existing IT services to remain stable
  3. The business wants to get these two benefits as cheap and efficiently as possible

(*) I initially had only the first two but I added a third one to be aligned with chapter 7 of Gene Kim‘s “The Top 11 Things You Need To Know About DevOps” whitepaper (available on itrevolution.com after signing up)

The solution was to introduce DevOps on three levels: process (make it as rational and consistent as possible), tools (automate the process, especially where things must happen fast or in high volume) and culture (changing how people think, which is the most difficult and time consuming part).

I also added two remarks that I would like to keep in mind in the remainder of this post:

First of all, this territory of applying DevOps to a traditional enterprise is relatively new so we should be very cautious each step we take and see if what we try to do really makes sense. What has worked for others (the modern companies) may not necessarily work for us because we have an existing structure to take into account.

And secondly, as fiddling with the software delivery process in a large enterprise is a huge undertaking that has impacts on many people in many departments, we should iterate in small steps towards a better situation and start by tackling the biggest and simplest problems first, optimizing and automating where appropriate. This way we leave the time for the people to get used to the new process.

In my second post I talked about these “biggest and simplest” problems that existed in a traditional company:

  • The exponentially increasing need for release coordination, which was characterized by it’s manual nature, and the pressure it exerted on the deployment window
  • The inconsistent, vague, incomplete and/or erroneous deployment instructions
  • The problems surrounding configuration management: being too vague and not reliable due to the manual nature of keeping track of them
  • The manual nature of testing happening at the end of a long process meaning that it must absorb all upstream planning issues – this eventually causes the removal of any not signed-off change requests
  • The desire by the developers to sneak in late features into the release and thereby bypassing the validations

These problems were caused (or at least permitted) by the way the IT department was organized, more precisely:

  • The heterogeneity of the application landscape, the development teams and the ops teams
  • The strong focus on the integration of the business applications and the high degree of coupling between them
  • The manual nature of configuration management, release management, acceptance testing, server provisioning and environment creation
  • The low frequency of the releases

Going back to the theoretical benefits of DevOps, in this case there was definitely a more urgent need for improving the quality and efficiency (the second and third benefit) than there was one for improving the speed of delivery (the first one).

Let us have a look now at the first steps we took in getting these problems solved: bringing configuration management under control.

It was quite obvious to me that we should start by bringing configuration management under control as it is the core of the whole ecosystem. A lot of building blocks were already present, it was only a matter of gluing them together and filling up the remaining gaps

This first step consisted of three sub-steps:

  1. Getting a mutual agreement on the structure of configuration management
  2. The implementation of a configuration management system (CMS) and a software repository – or definitive media library (DML) in ITIL terms
  3. The integration with the existing change management tool, build/continuous integration tools and deployment scripts

We decided to create three levels of configuration items (CI’s): on top the level of the business application. Business applications can be seen as the “units of service” that the IT department provides to the business. Below it the level of the software component. A software component refers to a deployable software package like third party application, a web application, a logical database or a set of related ETL flows and consists of all files that are needed to deploy the component. And on the bottom the level of the source code that consists of the files that are needed to build the component.

Here is a high-level overview of the three levels of CI’s:ThreeLevelsOfConfigurationManagementAnd here is an overview of the three levels of configuration management for a sample application CloudStore:ConfigurationManagementExampleI believe that the decision on defining the applications and components is a very important one that has the potential to make the rest of the software delivery process a lot harder if not correctly done. This is a whole subject on its own but let me summarize this briefly by saying that an application should contain closely related source code (a good way of identifying relatedness is if two files must regularly be changed together for the implementation of a feature), should not be “too big” and not be “too small”, considering that what is small and big depends completely on the context. A component can be seen as a decomposition of the application by technology, although it can be further split up, e.g. to reflect the structure of the development team.

CI’s are not static. They must change over time as part of the implementation of new features and these changes must be tracked by versioning the CI’s. So you could see a version number as a way of uniquely identifying a particular set of features that are contained within the CI.

Here is an example of how the application and its components are versioned following the implementation of a change request:ConfirurationManagementVersionsOnce this structure was defined, the complete list of business applications was created and each application was statically (or at least “slowly changing”-ly (*)) linked to the components it was made up of as well as to the development team they were maintained by. Change requests that were created for a particular application could not be used anymore to change components that belonged to a different application. And deployment requests that were created for a particular application could not be used anymore to deploy components of a different application. In these cases a new change request or deployment request had to be created for this second application.

(*) It is fine to split up components that have become too big over time, or to link them to a different application or development team, but this should have an architectural or organizational foundation and not be driven by a desire to limit the number of deployment requests for the coming release.

So the resulting conceptual model looked as following:ConceptualModelFollowing the arrows: an application contains one or more software components. There can be multiple change requests per application and release, but only one deployment request and it is used to deploy one, some or all of the software components.

Time for a little side note on dependencies and impact analysis now.

Most of the time the features that were requested by the business required modifications to multiple applications (this had to do with the high degree of integration and coupling that existed). For example if the application had to show an additional field not only the application itself but also the central data hub and the provider of the data possibly had to be modified to send the underlying data to the application. One of the big consequences of limiting the change requests to apply to only one application is that such feature requests had to be split over multiple change requests, one for each impacted application. These change requests then had to be associated to one another and such an association was called a functional dependency.

Technical dependencies also exist: a change request that requires a particular component to be modified always depends on all preceding change requests for which this same component had to be modified. So even though two change requests could be functionally completely independent, as soon as they touch the same component a technical dependency is created.

Note that these dependencies are most of the time uni-directional: one change request depends on the other but not necessarily the other way around: if this other change request is implemented in a backward compatible way then it doesn’t depend on the first change request.

I already mentioned in my previous post that it happened regularly that change requests had to be removed from a release because the testers were not able to sign them off. In these cases all functionally and technically dependent change requests also had to be removed from the release (except if the dev team still had time to come up with a new version of the component from which they had reverted the code changes for that not-signed-off change request of course) so it was important that this information was well tracked throughout the development process.

OK let’s leave the world of dependencies and impact analysis behind us for now and switch our focus to the implementation of the configuration management process.

In terms of storage and management, each of these three levels of CI’s has its own solution.

Source code is stored and managed by version control systems. This level is well documented and there is good support by tooling (git, svn, TFS, …) so I will not further discuss it here.

On the level of the components, the files that represent the component (executables, libraries, config files, scripts, setup packages, …) are typically created from their associated source code by build tools and physically stored in the DML. All relevant information about the component (the application it belongs to, the context in which it was developed and built, …) is stored in the CMS.

The business application has no physical presence, it only exists as a piece of information in the CMS and is linked to the components it is made up from.

Here is an overview of the implementation view:ImplementationViewOnce these rules were agreed, a lightweight tool was built to serve as both the CMS and DML. It was integrated with the build tools in such a way that it automatically received and stored the built files after each successful build of a particular component. By restricting the upload of files to exclusively the build tools (who in their turn assure that the component has successfully passed the unit-tests and the deployment test to a development server) at least a minimum level of quality was assured. Additionally, once a particular version number of a component was uploaded, it was frozen. Attempts to upload a newer “version” of the components with a version number that was already present would fail.

The build tools not only sent the physical files, but also all the information that was relevant to the downstream processes: the person who built it, when it was built, the commit messages (a.k.a. check-in comments) since the previous version, the file diffs since the previous version, …

Note that a version typically contained multiple commits (continuous “build” was only optional for the development teams).

With this information the CMS was able to calculate some interesting pieces of information that used to be hard to manually keep track of before, namely the team that is responsible for doing the deployment and the logical server group (e.g. “Java web DEV” or “.NET Citrix UAT”) to deploy to. Both were functions of the technology, the environment and sometimes other parameters that were part of the received information. As these calculation rules were quite volatile, they were implemented in some simple scripts that could be modified on-the-fly by the administrator of the CMS whenever the rules changed.

The CMS also parsed the change request identifiers from the commit messages and retrieved the relevant details about it from the change management tool (another integration we have implemented): the summary, type, status and the release for which it was planned.

The presence of the core data that came from the build tools combined with the calculated data and especially the data retrieved from the change management tool transformed the initially quite boring CMS to a small intelligence center. It became possible now to see the components (and their versions) that implemented a particular change request or even those that implemented any change requests for a particular release. (All of this assumes off course that the commit messages correctly contain the identifiers of the change request.)

In the following mock up you can see how the core data of a component is extended with information from change management and operations:Mockup componentIt’s also interesting to look at configuration management from a higher level, from the level of the application for example:Mockup applicationBy default all the components of the selected application were shown and for each component all versions since the version that was currently deployed in production were listed in descending order. In addition, the implemented change requests were also mentioned. (Note that the mock up also shows the which versions of the component were deployed in which environment, more on this in my next post about the implementation of a release management tool.)

In addition to this default view it was also possible to filter by release or just by an individual change request. In both cases only the relevant components and version numbers were shown.

The CMS also contained logic to detect a couple of basic inconsistencies:

  • A particular version of a component didn’t implement any change requests (in that case it was still possible to manually associate a change request with the version)
  • A component version implemented change requests that were planned for different releases (would be quite difficult to decide when to deploy it right?)
  • A change request that was planned for a particular release was not implemented by any components

All of this information facilitated the work of the developer in determining the components he had to include in his deployment request for the next release. But it was also interesting during impact analysis (e.g. after a change request was removed from a release) to find out more information about an individual component or about the dependencies between components.

This ability to do impact analysis when a change request had to be removed from a release was a big deal for the company. One of the dreams of the people involved in this difficult task was the ability to get a complete list of all change requests that (functionally or technically) depend on this change request. Although it was not actually developed initially, it would not be very difficult to do now that all the necessary information was available in the CMS. The same could be said about including more intelligent consistency checks: it’s quite a small development effort for sometimes important insights that could save a lot of time.

Finally the deployment scripts of all technologies were adapted in such a way that they always retrieved the deployable files from the DML and as such finalizing the last step of a fully controlled software delivery pipeline from version control tool to production.

Here is an overview of how the CMS and DML were integrated with the existing tools:ConfigurationManagementIntegration

Conclusion

All in all, a small development effort resulted in a big step forward in bringing control to the software delivery process. With this core building block – that configuration management is – now in place we can continue to build the rest of our castle on top of it.

You may have noticed that I have emphasized throughout this post the importance of keeping track of the dependencies between the change requests and the components so we are able to do proper impact analysis when the need arises. These dependencies are the cause of many problems within the software delivery process and therefore a lot of effort has to be put into finding ways to remove or at least limit their negative impact. The solution I mentioned here was all about visualizing the dependencies which is only the first step of the solution.

A much better strategy would be to avoid these dependencies in the first place. And the best way to do this is by simply decreasing the size of the releases which comes down to increasing the frequency of the releases. When releases happen infrequently, the change requests typically stack up to a large pile and in the end all change requests are dependent on one another due to the technical dependencies that exist through their components. You remove one and the whole card house collapses. But increasing the release frequency requires decent automation and this is exactly what we’re working on! But until the whole flow is automated and the release frequency can be increased we have to live with this problem.

If we take this strategy of increasing the release frequency to its extremes we will end up with continuous delivery, where each commit to each component triggers a new mini-release, one that contains a one-step deployment, that of the component that was committed. No more dependencies, no more impact analysis, no more problems! Nice and easy! Nice? Yes! Easy? Maybe not so. Because this approach doesn’t come for free.

Let me explain.

With the batch-style releases at least you could assume that whenever the new version of your component is deployed into an environment it will find the new versions of all components it depends on (remember that most of our features required changes to multiple components). It doesn’t have to take into account that old versions of these components may still hang around. With continuous delivery, this assumption is not guaranteed anymore in my opinion. It’s now up to the developer to make sure that his component supports both the old and the new functionality and that he includes a feature flag to activate the feature only when all components are released. In some organizations (and I’m thinking about the large traditional ones with lots of integrated applications) this may be a high price to pay.

In my next post I will explain how we implemented a solution for release management.

I recently had the opportunity to talk about my experience in introducing DevOps in a traditional enterprise at the DevOpsDays in London, so let me conclude by giving you a link to the video and the slideshow of my talk.

Previous posts in the series:

Next posts in the series:

A closer look at introducing DevOps in a traditional enterprise

In my previous post I already explained what I understand by the term DevOps: it is a way of improving the software delivery process by bringing together devs, ops, and a lot of other teams with the purpose of delivering changes fast while keeping a reliable environment.

I also gave a summary of how I used this DevOps mindset as a guide to solve one of the most urgent and visible problems that existed within the realm of software delivery in one of a large traditional enterprise I worked for in the past (and is very typical for similar enterprises): the manual nature of the configuration and release management activities and the increasing challenges that it brought to the efficient delivery of software. The solution was to rationalize the software delivery process and to implement configuration and release management tools to support the release management team.

But the manual nature of configuration and release management was not the only problem that stood in the way of an efficient software delivery process and not all of these problems were magically solved by this one solution either.

Before taking a closer look at the exact problems I’m talking about, let me first draw out the playing field by explaining how the different teams in the IT department were organized, what their roles and responsibilities were and how they interacted with one another.

This introduction should help in getting a better appreciation of the problems and will be a good starting point when we want to apply some root cause analysis of the found problems. This is important because it may well be that the visible problem is only a consequence of a deeper lying invisible (or less-visible) problem. Most of the time it will not help to solve this visible problem, as the root problem will just find another way to manifest itself. And if we’re lucky, by solving the root problem some other visible problems that exist down the wire may get automatically solved in the same go. So it’s always interesting to try to find the relationships between the various problems before jumping into solution-mode.

separator
Side note: If we continue to dig into the root cause analysis we should eventually end up on the level where the fundamental, strategic decisions are made on running the whole IT department or even the whole business. On this level, making changes towards a better software delivery process may have huge consequences on other parts of the business so we should always be cautious for possible negative side-effects.

separatorOK here I go with an overview of the teams involved in the software delivery process:

Teams

user_groupThe Strategy and Architecture teams

As with so many other enterprises of which information technology is not the core business, the general rule on automating new business processes was “buy before build”: first look on the market if any business applications already exist that can do the job and only if nothing is found evaluate the possibility to build one in-house.

As we will see, this short and simple rule has a big impact on the whole structure and organization of the IT department.

The first impact is on the heterogeneity of the environment: although there were some constraints these third party applications had to take into account before being considered – I’m thinking about the support for specific database vendors, application servers and integration technologies – it is the vendor who basically decides the (remaining) technology stack of the application, the internal design, the used libraries, the type of installation (manual or automated), the (lack of) testability, etc. With so many third party applications that together form the corporate application landscape the result is a very heterogeneous environment.

Secondly a strong focus on the integration of these applications is needed in order to deliver a consistent service to the business side. The applications have their own specialized view of the world but at the same time must act as producers and consumers of each others data.This means that, in addition to simply transferring the data, at some point also a transformation of this data must be done from one world to the other to make them understand each other.

The implementation of these integrations was done with specialized development platforms. The data was initially transferred during nightly ETL batches but due to the need for more up-to-date information EAI and messaging was gradually replacing these batches.

With the growth of the application landscape over time and as it became more and more difficult to keep track of the different data flows and of the ownership of the data (who is responsible for which data) a strategic decision had been made long ago to create a single system of record, a central data hub in other words, to provide a one-and-only “Source of Truth”. Applications that create data (or get them from the outside world) have to feed it into the data hub and applications that need data have to get it from the data hub. Architecturally this is definitely a sound decision that can simplify life a lot but software delivery wise it creates an interesting situation: each change that involves modifying the data flows from producer to consumer (which most of the changes do) automatically creates a dependency on the data hub’s development team, thereby coupling the otherwise unrelated changes, if not through their associated software components then at least on the project management level of the data hub’s development team.

Another strategic decision was the mutualisation of infrastructure: all in-house developed software that shared the same development technology had to be hosted on one shared server. Sharing of infrastructure gives the best overall capacity (valleys in one app cover up for peaks in another app). It also keeps the total number of servers low which rationalizes on the maintenance efforts by the ops teams. And new projects are quickly up-to-speed because there is no need to buy and setup the infrastructure. But here as well, a coupling is created between the otherwise unrelated software components. If one misbehaves the others are impacted.

Higher management generally preferred stability and compliance over agility and reduction of feedback cycle. They were inspired by corporate process frameworks like CMMi, ITIL and TOGAF. Unfortunately for some bizarre reason these frameworks always seem to get implemented in a more bureaucratic, top-to-bottom feedback-loop-allergic way than what I would consider ideal. On the other hand, there was less awareness of the latest evolutions that are happening on the work floor like agile development, the DevOps movement, open source tooling for automated testing, provisioning and monitoring, etc. All concepts that tend to have a more bottom-up direction.

user_groupThe Development Tools team

The Development Tools team was responsible for defining the standards and guidelines for the supported development technologies and for the implementation of all cross-functional needs of the development teams: the development tools, the version control systems, the build and deployment automations, framework components, etc.

Unfortunately, for the levels of configuration management that go beyond what version control systems have to offer there was no tooling or automation in place. It was up to the development teams to keep track of which software component was part of which application and which version of which component was deployed in which environment. Most of the time this was done in Excel by the person that last joined the team ;-). There was also no uniform place to store the different versions of the components. Each build tool had its own dedicated software repository and non-buildable artifacts like database scripts were even manually stored on whatever location was accessible by the ops team that was responsible for its deployment.

Code reviews, although part of this team’s objectives, were done very infrequently due to a lack of time (in practice they only happened a posteriori, after something bad had already occurred). As a consequence the standards and guidelines were not always followed. The team also had a limited involvement in the hiring process of new developers. Both these shortcomings acted as catalysts to the increase of heterogeneity of both the development teams as the code base.

user_groupThe Development teams

There were about 15 development teams, each one covering the IT needs for one business domain. As I already mentioned before most of these teams focused primarily on the integration of the third party applications. Only a few of them worked on completely in-house built applications.

The developer profiles were quite diverse: there were the gurus as well as the relatively novices, the passionate “eat-sleep-dream-coders” as well as the “9-to-5” types, the academics as well as the cowboys pragmatics, the “cutting-edge” types as well as the “traditional” types. Orthogonally you also had the job-hopping, the different cultures and languages etc. to make it quite a diverse community. But despite all these apparent differences, each team was relatively successful in finding their way to cooperate and deliver value.

The unfortunate side effect of this was that the internal processes and habits within each team work varied considerably, a fact that also surfaced in their varying needs towards configuration management and release management: some wanted it as simple as possible, others wanted to tweak it to the extreme in order to facilitate their exotic requirements.

But next to these differences, there were also quite some – unfortunate – similarities in problem areas between the teams. Lack of logging and unit testing was widespread. So was their struggle to keep track of all the files and instructions that had to be included in the next release. Development and delivery in small increments was only done by a small number of teams, dark releases or feature flags were not done by any team.

The development teams responsible for the core application had a rotating 24/7 on-call system to make sure that someone was available to help out the ops guys whenever an emergency occurred. During the weekend releases a senior member of each development team would be on-site to help solve potential deployment issues to keep the release as efficiently as possible.

user_groupThe Change Management team

The change management process was inspired by ITIL and combined with the CMMi-based software development life cycle into an automated workflow. A whole range of statuses and transitions existed between the creation and the final delivery of the change request.

The change manager’s role was to review and either approve or reject each change request when it was assigned to a release (releases were company-wide and happened monthly, see more in the next section) and before it was deployed to the UAT and production environments. The approvals depended on formality checks (is everything filled in correctly, are the changes signed-off by the testers, did they register their test runs, …) and ops availability and to a much lesser degree on content and business risk assessment.

All of this together made the creation and follow-up of change request quite a formal and time consuming experience. As a consequence, most of the time during the life cycle of the change there was a disconnect between the reality and what the tool showed because people forgot to update it.

user_groupThe Release Management team – release planning

The corporate release strategy as decided by the release management team was to have company-wide (a.k.a. orchestrated) monthly releases with cold deployments to production during the weekend. A release is scheduled as a set of phases with milestones in between: development and integration, UAT testing, production. Changes that for some reason didn’t fit in this release schedule could request an on-demand application-specific release. Same for bug fixes.

There was a static set of environments that were shared between all applications. The “early” environments allowed the development teams to deploy their software whenever  they needed to while for the “later” environments the release management team acted as the gatekeeper, with the gates only opening during the predefined deployment windows as foreseen in the release schedules.

It was very difficult (and expensive) to set up new environments due to the high cost of manually setting it up, the cost of keeping it all operational, licensing costs, high storage needs for databases (see further: testing on a copy of production data). It was also very hard to exactly duplicate existing servers because there was no reliable way of tracking all previous manipulations on the already existing servers. As a consequence, the test teams had to share the existing environments which was regularly a source of interference, especially if major changes to the application landscape – that typically span more than one release cycle – had to be tested in parallel with the regular changes.

user_groupThe Release Management team – release coordination

The release coordinator received the deployment requests from the developers and assembled them into a release plan. Although the release coordinator was supposed to validate the deployment instructions before accepting them in the release plan, in practice only a quick formality check was done due to a lack of time.

The deployment requests were documented in Word and contained the detailed instructions on how to deploy the relevant software components, config files, etc. The deployment request was limited to one development team, one application and one or more change requests.

During the day of the deployment the release coordinator then coordinated the deployments of the requests with the different ops teams.

This whole flow of documents from the developers through the release coordinator to the ops teams was done through simple mails. As was the feedback from the ops teams: a simple confirmation if all went OK but in case something unexpected occurred the mail thread could quickly grow in size (and recipients) while bumping from one team to the other.

The release coordinator organized pre-deployment and post-deployment meetings with the representatives of the development and ops teams. The former to review the release plan he had assembled, the latter to do no-blame root cause analysis of encountered issues and to improve future releases.

A technically oriented CAB was organized the Monday before the production deployment weekend and included representatives of the development teams, ops teams, infrastructure architects, the change manager and the release coordinator. The purpose of this meeting was to loop through the change requests that were assigned to the coming release and to assess their technical risk before giving them a final GO/NOGO. The business side was typically not represented during this meeting so the business risk assessment done here was limited.

user_groupThe Operations teams

The operations department consisted of several teams that each were responsible for a specific type of resource: business applications, databases, all types of middleware, servers, networks, firewalls, storage, etc.

Some of these teams were internal, others were part of an external provider. The internal teams were in general easily accessible (not in the least because of their physical presence close to the development teams) and had a get-the-job-done mentality: if a particular deployment instruction didn’t work as expected they would first try to solve it themselves using their own experience and creativity and if that didn’t do it they would contact the developer and ask him to come over to help solve the problem.

The external teams on the other hand worked according to contractually agreed upon SLA’s, their responsibilities were well defined and the communication channels were more formal. As a result there was less personal interaction with the developers. Issues during the deployments – even minor ones – would typically be escalated during which the deployment would be put on hold until a corrective action was requested.

It was not always simple for developers to find out which ops team was responsible for executing a particular deployment instruction. E.g. regular commands had to be executed by team X, while the commands that required root access had to be executed by team Y. Or database scripts were executed by DBA team X in UAT but by DBA team Y in production. On top of that, these decision trees were quite prone to change over time. And just like in the development teams not all members of the ops team had the same profile or experience.

user_groupThe Test teams

The testers were actually part of the development teams which means that there was no dedicated test team.

Although some of the development teams extensively used unit-testing to test the behaviour of their applications, the integration and acceptance testing was still a very manual activity. There were a couple of reasons for this. To start with, some of the development tools and many of the business applications just didn’t lend themselves easily (at least not out-of-the-box) to setting up automated testing. Secondly, there was limited experience within the company to set up automated testing. Occasionally someone would initiate an attempt at the automated testing of a small piece of software in his corner but without the support of the rest of the community (developers, ops people, management, …) this initiative would die out soon.

Another phenomenon I noticed was the fact that the testers preferred to test on a copy of the production data (not even a subset of the most recent data but a copywith the full history). I’m sure they will need less efforts to spot inconsistencies in a context that closely resembles real life but there are some major disadvantages with this approach as well. First of all disk space (especially in a RAID5 setup) needed by the databases doesn’t come for free. Secondly you have the long waiting time when copying the data from production to test or when taking a backup before a complex database script is executed. And finally and maybe most importantly there is the “needle-in-a-haystack” feeling that comes up whenever the developer has to search the root cause of a discovered issue while digging through thousands or even millions of records.

user_groupThe OTHER teams

I will not go in detail on the Security, Procurement, PMO, Methodology and Audit teams because they are typically less involved with the delivery of changes to existing applications (the focus of this post) then they are with the setup of new projects and applications.

separator

Side note: So why did we have company-wide releases and why did they happen so infrequently?

The reasons can be found in the decisions that were taken in the other teams:

    • Due to the design of the functional architecture (in particular the many third party applications and the central data hub) most of the changes had impacts on multiple applications simultaneously so there was a need for at least some orchestrated release efforts anyway
    • Due to the manual nature of the deployments there was a financial motivation to rationalize on the number of weekends that the ops teams had to be on-site
    • Due to the manual nature of testing here as well there was a financial (but in this case also a planning-wise) motivation to combine the changes as much as possible into one regression testing effort
    • Avoiding the need for application deployments during each weekend freed up the other weekend for doing pure infrastructure-related releases

separatorOK that was it for the explanation of the teams, time for a little recap now:

In the previous sections we have seen that the IT department was characterized by the following aspects:

  • The heterogeneity of the application landscape, the development teams and the ops teams:

Heterogenity

  • The strong focus on the integration of the business applications and the high degree of coupling due to the sharing of the data hub and the ETL/EAI platform:

Integration and coupling

  • The manual nature of configuration management, release management, acceptance testing, server provisioning and environment creation:

Manual work

  • The fact that the IT processes are governed by corporate frameworks (CMMi, ITIL, TOGAF)

Although I have no real-life experience yet with working in a “modern/web 2.0/startup-type” of company, from what I have read and heard they are more homogeneous in technology as well as in culture, they usually work on one or a few big componentized (but decoupled) products, were built with automation written in their DNA and are guided by lightweight, organically grown IT processes. All in all very different from the type of company I just described. It seems to me that although both types of companies (the traditional and the modern) have in common that they both want to make money by strategically using IT, they differ in the fundamentals on which their IT department is built.

Now that you have a better understanding of the roles and responsibilities that were involved in the world of software delivery let us have a look at the biggest problems that existed.

Problems

problem-iconrelease coordination

As already mentioned, the release consists of a set of phases. One of them is the UAT phase and takes between one and two weeks. During this phase the software is deployed and acceptance tested. The time foreseen for the deployment of the software is referred to by the term ‘deployment window’.

It turned out that the deployments consistently started to run outside of this deployment window, thereby putting more and more stress on the downstream plannings.

The reason for this is as following. The development teams and ops teams had been growing more a less linearly over time during the last couple of years, but because the complexity of the release coordination (and in turn the effort needed) depends on the number of connections between these two sides the release management team had to grow exponentially to keep up with the pace. And due to the manual nature of the release coordination work this added complexity and effort translated in exponentially more time needed. I know this explanation is far from scientific but you get the point.

problem-iconDeployment instructions

One of the long standing problems – and a source of friction and frustration by whomever came in touch with it – were the way in which the deployment instructions were written.

First of all the instructions lacked standardization. Even for the technologies that had build and deployment scripts the instructions were by some people written as the whole command to be run (e.g. “deploy.sh <component name> <version>”), by others in pure descriptive text (e.g. “please deploy version 1.2.3 of component XYZ”) and in some rare cases the deployment script was even replaced by a custom deployment process. Note that this problem had been addressed at some point by requiring the usage of “template” instructions for the core development technologies and consisted of a predefined structure with a field for each parameter of the script.

The instructions for technologies that had no build or deployment scripts also lacked standardization regarding the location of the source files: sometimes they were made available on a shared location (writable to everyone within the company), other times they were shared from the developer’s pc or pre-uploaded to the server. A closely related problem was that the links to these files sometimes referred to non-existent or outdated files.

Many instructions were also too vague or required too much technical knowledge on the technology or the infrastructure for the ops teams to understand them correctly.

Regularly steps were missing from the instructions which caused the application to fail during the restart. It could be a whole component that was not updated to the intended version, or simply a directory that was not created.

And finally there were the minor errors like copy/paste errors, typos, the Word doc that cuts a long command and adds a space, etc. You name it, I probably have witnessed it 😉

We can summarize this problem by saying that the devs and ops teams lack the mutual understanding that is necessary to efficiently communicate with each other, or in other words the ops teams are not able to make the assumptions that are implicitly expected by the dev teams.

No matter what was the source of the bad or wrong deployment instructions, one thing was certain: it would take a lot of efforts and time from everyone involved to get the instructions corrected. Not to mention the frustration, the friction and the bad image it caused to the whole software delivery experience.

What was even more destructive was the fact that after too many troubleshooting efforts and the accompanying corrective actions the exact steps to deploy the application got lost in the dust and a new snowflake server was born.

problem-iconConfiguration management

Here we arrive at the core of the whole ecosystem and also my favorite subject: configuration management. It represents the foundation on which the castle is built so the smallest problem here may have major consequences further on.

To start with, there was no clear list of all business applications – what is called a service catalog in ITIL – that existed within the IT department and neither was it clear who was the business owner or the technical owner of each application. There was also no clear view on which components an application was exactly made up of.

Note that with (business) application I mean the higher-level concept that delivers a business value, while with a (software) component I mean a deployable software package like a web application, a logical database, a set of related ETL flows or BO reports, etc. An application is made up of a set of components.

We already saw that many processes were manual. One of the consequences was that this allowed for too much freedom on the structure of configuration management:

  • Different teams were able to work on the same component (this was an issue of code ownership)
  • The same deployment request sometimes contained the instructions for components of more than one applications
  • Components were sometimes moved from one application to another to avoid creating too much deployment requests
  • Component version numbers were sometimes recycled for later versions which made it impossible to unambiguously link a version number with a specific feature set

There were many other issues that were very specific to this particular context but it would lead us too far to explain them here.

All these issues made it very hard or even impossible to keep track of which code base (or version number) of the application was linked to which feature set, which one was deployed in the test environments, which one was signed-off by the testers and which one was allowed to go into production.

problem-iconTesting/QA

The UAT phase took between one and two weeks during which the software had to be deployed and the changes tested and hopefully signed-off by the testers.

Due to the aforementioned problems the deployments took considerably more time than initially planned for and as such reduced the period during which the tests could be run.

Moreover, testing was a manual process and even with a full testing period the testers workforce was too small to test all edge cases and regressions.

For these two reasons it happened that testers were not able to sign-off the change requests before the deadline. These change requests (and the components that were modified to implement them) then had to be removed from the release which resulted on its turn in the need for a last-minute impact analysis to identify and remove all related change requests. The effect of this analysis was worsened by the many integrations (i.e. there was a high degree of coupling between the changes and components), the huge size of the releases (in the end every component becomes directly or indirectly dependent on each other component), the lack of configuration management and arguably the most dangerous of all: the lack of time to retest the changes that were kept in the release.

problem-iconHacking

Due to the pressure of the business to deliver new features and to the low release frequency there was a tendency by developers to try to find ways to bypass the restrictions that were imposed by the release management process.

One of these “hacks” was to sneak new features in the release just before the deployment to production by overwriting the source files with an updated version while keeping the same version number as the tested and signed-off one.

Due to a lack of branching in some technologies it also happened (especially as part of urgent bug fixes) that individual files were sent over to the ops teams instead of the whole deployable unit.

All of these non-standard activities contributed to the difficulty to do proper configuration management.

Conclusion

OK that’s about all of the high priority problems that existed in the company. Now we are armed with the proper knowledge to start looking for solutions.

Congratulations if you made it till here! It means that you are very passionate – like me – about the subject. It was initially not my intention to create such a long and heavy post but I felt it was the only way to give you a real sense of the complexity of the subject in a real environment.

In my next post I will explain how we were able to bring configuration management under control and how we implemented a release management tool on top of it that used all the information that was already available in the other systems to facilitate the administrative and recurrent work of the developers, release coordinator and ops teams so that they could concentrate on the more interesting and challenging problems.

I will also take a closer look at the possible solutions that exist to tackle the remaining problems in order to gradually evolve to our beloved “fast changes & reliable environment” goal.

Previous posts in the series:

Next posts in the series:

My experience with introducing DevOps in a traditional enterprise

Last week I finished reading the book The Phoenix Project by Gene Kim and I really enjoyed it. It describes the adventures of a guy named Bill who almost single-handedly (although he had the help of demigod Erik) saves the company from bankruptcy by “industrializing” the IT department and in particular the delivery of new software from development to operations (in short DevOps).

Having worked in a similar context for the past years I could easily identify with Bill and therefore the book has given me a great insight in the various solution patterns that exist around this topic.

I have to agree however with the IT Skeptic on some of the critics he wrote in his review of the book:

Fictionalising allows you to paint an idealised picture, and yet make it seem real, plausible. Skeptics know that humans give disproportionate weight to anecdote as evidence.

And

It is pure Hollywood, all American feel-good happy endings.

The Phoenix Project wants you to believe that within weeks the staff are complying with the new change process, that business managers are taking no for an answer, that there are simple fixes to complex problems.

So basically he doesn’t believe that the book does a good job in representing the reality.

That’s when I reasoned it may be interesting for others if I published my own experiences in this area in an attempt to fill the hole between Gene’s fairy tale and the IT Skeptic’s reality.

The theory

Let me start by giving you my view on this whole subject of DevOps (highly influenced by the excellent presentation of John Allspaw and Paul Hammond of Flickr):

  1. The business – whether that be the sponsor or the end-user – wants the software to change, to adapt it to the changing world it represents, and it want this to happen fast (most of the time as fast as possible)
  2. At the same time, the business wants the existing IT services to remain stable or at least not disrupted from the introduction of changes

The problem with the traditional software delivery process (or the lack thereof) is that it is not well adapted to support these two requirements simultaneously. So companies have to choose between either delivering changes fast and ending up with a messy production environment or keeping a stable but outdated environment.

This doesn’t work very well. Most of the time they will still want both and therefore put pressure on the developers to deliver fast and on the ops guys to keep their infrastructure stable. It is no wonder that dev and ops will start fighting with each other to protect their objectives and as a result they will gradually start drifting away from each other, leaving the Head of IT somewhere stuck inside the gap that arises between the two departments.

This picture pretty much summarizes the position of the Head of IT:

Head of IT - annotatedYou can already guess what will happen with the poor man when the business sends both horses into different directions.

The solution is to somehow redefine the software delivery process in order to enable it to support these two requirements – fast and stable – simultaneously.

But how exactly should we redefine it? Let us have a look at this question from the point of view of the three layers that make up the software delivery process: the process itself, the tooling to support it and the culture (i.e. the people who use it):

Process

First of all the process should be logic and efficient. It’s funny how sometimes big improvements can be made by just drawing the high-level process on a piece of paper and removing the obvious inconsistencies.

The process should initially take into account the complexities of the particular context (like the company’s application landscape, its technologies, team structure, etc) but in a later stage it should be possible to adapt the context in order to improve the process wherever the effort is worth it (e.g. switch from a difficult to automate development technology to an easy to automate one).

Tools

Secondly, especially if the emphasis is on delivering the changes fast, the process should be automated where possible. This has the added benefit that the produced data has a higher level of confidence (and therefore will more easily be used by whoever has an interest) and that the executed workflows are consistent with one another and not dependent on the mood or fallibility of a human being.

Automation should also be non-intrusive with regards to human intervention. With this I mean that whenever a situation occurs that is not supported by the automation (e.g. because it is too new, too complex to automate and/or happens only rarely) it should be possible for humans to take over from there and when the job is done give control back to the automation. This ability doesn’t come for free but must explicitly be designed for.

Culture

And then there is the cultural part: everyone involved in the process should get at least a high-level understanding of the end-to-end flow and in particular a sufficient knowledge of their own box and all interfacing boxes.

It is also well-known that people resist change. It should therefore not come as a surprise that they will resist changes to the software delivery process, a process that has a huge impact on they way they work day-to-day. There can also be a trade-off: some changes are good for the whole (the IT department or the company) but bad for a unit (a developer, a team, …). An example would be a developer who has to switch to Oracle because that’s what the other teams have experience with, while the developer wants to stick with MySQL because that’s what he knows best. Such changes are particularly difficult to introduce so they should be treated with care. But most of the people are reasonable by nature and will adapt as soon as they understand the bigger picture.

OK so now we know how we should upgrade the process.

But are we sure that it will bring us the expected results? Is there any evidence? At least the major tech companies like Flickr (see the link above), Facebook (here), Etsy (here), Amazon (here) etc show us that it works in their particular case. But what about the traditional enterprises with their legacy technologies that lack support for automation and testability, their human workforce doesn’t necessarily know or embrace these modern concepts, their heavy usage of frameworks that are known to have bureaucratic top-down implementations (CMMi,ITIL, TOGAF, …)? Can we apply the same patterns in such an extremely different context and hope for the same result? Or are these traditional companies doomed and awaiting to be replaced by modern companies, those that were built with agility and automation in mind? Or in other words is there a trail from Mount Traditional to Mount Modern?

I don’t know the answer but I don’t see why it would not work and there is not really the option of not at least trying it, is there? See also Jez Humble’s talk and Charles Betz’s post on this subject.

My experience

OK let us leave the theory behind us and switch our focus to my own experience in introducing DevOps in a traditional enterprise in the financial sector.

First a bit about myself: in this traditional enterprise I was responsible for the development community. My role was to take care of all cross-functional stuff like the administration of the version control tools, build and deployment automation, defining the configuration management process, etc. I also was the representative of the developers towards the infrastructure, operations and release management teams.

Configuration and release management had always remained mostly a manual and labor-intensive activity that was known to be one of the most challenging steps in the whole process of building and delivering their software. Due to the steady growth of the development community and the increasing complexity of technologies the problems surrounding these two activities grew exponentially over time causing a major bottleneck in the delivery of software. At one point it became clear to me that we had to solve this problem by doing what we are good at: automation. Not for the business this time but for ourselves: the IT department. So I set out on a great adventure to get this issue fixed …

It was clear very soon that without bringing configuration management under control, it would not be possible to bring control in release management because it relies so heavily on it. And thanks to this we would also be able to finally solve a number of other long-standing issues that all required proper configuration management but didn’t seem important enough on their own to initiate it.

So finally the project consisted of the following actions:

  • getting a mutual agreement between all teams on the global software delivery process to be used within the company (including the definition of the configuration management structure), using ITIL as a guide
  • the implementation of a release management tool
  • the implementation of a configuration management system (CMS) and a software repository (which is called a definitive media library or DML in ITIL terms)
  • the integration with the existing build and deployment automations and the change management tool

Note that I use the term configuration management in the broader sense as used by Wikipedia (here) and ITIL and not in the sense of server provisioning or version control of source code: the definition of the configuration items, their versions and their dependencies. This information is typically contained in a CMS.

With release management I mean the review, planning and coordination of the deployment requests as they are delivered from the developers to the operations teams.

This is my mental model of these concepts:

Software delivery processFiddling with the software delivery process in big and heterogeneous environments as can be found in the traditional enterprises is a huge undertaking that has impacts in many areas within many departments and on many levels. It was therefore out of the question to transition from a mostly manual process to a fully automated one on one step. Instead we needed a tool that could be used to support the release management team with their work but that would also allow to gradually take more responsibilities as time goes by and help the transition to a more end-to-end automated approach.

During my research I was quite surprised that so few tools were available that aim to support release management teams and I believe now that the market was just not yet ready for it. Also in the traditional enterprises development has generally become more complex, more diverse, more service oriented etc in the last years and the supporting tools now need some time to adapt to this new situation.

(For those interested: after a proof of concept with the few vendors that claimed to have a solution we chose Streamstep’s SmartRelease, which was acquired by BMC and renamed to RPM during the project.)

Also the culture had to be adapted to the new process and tooling. This part involves changing the people’s habits and has proven to be the most challenging and time-consuming part of the job. But these temporary frictions were dwarfed by the tremendous gains we received by the automation: less manual work, simplified processes, better visibility, trustworthy metrics, etc.

By implementing a clear process that was unavoidably more restrictive – and above all enforceable by the tooling – I noticed that in general developers resisted it (because they lose their freedom to do things at their own will and believe me there were some exotic habits in the wild) and ops guys embraced it (because their job becomes easier).

Conclusion

As I already mentioned I don’t know if it is possible to DevOps-ify traditional enterprises the way web 2.0 companies do it (namely by almost completely automating the software delivery process and installing a culture of continuous delivery from the early beginnings). But it may make sense to not look too far ahead in the future but instead start by tackling the biggest and simplest problems first, optimizing and automating where appropriate, and as such iterating towards a better situation. And hope that in the meantime a web 2.0 company has not run away with their business ;-)

DinosaurCan traditional enterprises adapt to the modern world or will they become the equivalent of the dinosaurs and get extinct?

In my next post I will zoom one level deeper into the reasoning behind these decisions. I will first give an overview of the situation that existed before the project was initiated, looking at the specifics of each team. Then I will focus on the problems that derived from it and the solutions that were followed.

Next posts in the series: