Contact ITInvolve
x 


Improving Configuration Management: Getting Control Over Drift

Configuration Drift poses a number of challenges to your IT organization and your business; for example the risk of non-compliance with security policies, performance and availability issues, and the failed deployment of new application releases.

To address drift most IT organizations have now employed some combination of scripts (or automation tools), a configuration management database (CMDB), and have defined a software configuration management approval process.  Despite these efforts, we find that configuration drift still occurs a lot in large enterprises.

Why is this the case?

First, if you are like most IT organizations, you probably follow the 80/20 rule with your administrators focusing 80% of their time on the configuration elements they consider most important to their roles and that leaves quite a gap where drift can still occur.  What’s more, if you are using scripts and automation tools to enforce configurations, it’s important to keep in mind these approaches rely on an explicit formula – meaning you have to specify exactly which configuration settings to enforce and when.  This leaves things pretty much wide open that settings you haven’t gotten around to specifying can be changed and additional software installed that might cause problems.

For example, let’s say that your security policy states that a certain range of TCP/IP ports should not be open on a certain class of servers.  You might reinforce this policy with an automation script that routinely verifies the port status and closes any ports in the range that may have been opened through some other means.  Sounds like you’ve got things covered, right?  Well, what if that port was opened as part of a change process to deploy a new application to one of those servers, and what if those working on the project knew nothing about the TCP/IP port enforcement script.  They deploy the new application, test it to make sure all is working well, and then send out the email to the user community letting them know the new application has been launched – a great day for IT enabling the business! Then, overnight (or the next time the script is scheduled to run), the port is closed.  Users come into work the next day and are unable to access the new application, calls start coming into your service desk, an all hands on deck meeting is hastily assembled, and, after some period of time, the port closure is identified as the issue and the port is reopened – only to have it closed again the next time the script runs – until finally someone realizes this script is the underlying cause (because probably the person who wrote it is no longer there and they didn’t document it other via a notation in an audit report that a script was the enforcement mechanism selected.)

Consider another example, where we have an application that has very low utilization most days except for spikes of activity at the end of each month (such as an application that accepts orders from a dealer network).  Let’s say an engineer is looking for some available equipment to install a new application on and identifies the same server running the dealer order system as a good candidate because of its strong specs and low average utilization.  He installs the new app and everything is working great until the end of the month when the dealer network comes alive with hundreds of orders every hour.  Now because we have two applications vying for the same physical components, we start to see performance issues and scramble move the new application to other hardware, taking it offline in the process, and putting it on an available server with lesser specs causing it to run slower than before irritating the user community even further.  In this scenario, your automation scripts would have done nothing to prevent this drift from the expected configuration (i.e. the dealer order system is the only application running on this box), because they would have no awareness that the new application even existed.  What’s more, automation could have actually made things worse if you had employed a strategy to periodically wipe and rebuild your machines (these are referred to as “phoenix servers” and it’s another strategy some have tried to reduce drift) – because, in this case, if you had followed such an approach your new app would have been erased from your data center entirely at the new rebuild.

So how can you get control over drift and avoid these sorts of issues?

First, the scripts and automations you have running need to be documented including what they do, when they run, and who is responsible for them.  With this information, you can make people proactively aware of any script and configuration conflicts as part of your change and release management process.  This will help you avoid the first example where the TCP/IP port was unexpectedly closed, because your team is more aware of and can account for the fact that there needs to be an exception to your TCP/IP port range – not only updating the script to reflect this but also documenting the exception proactively for your auditors.

Second, with accurate documentation about how your environment and key applications are configured, you can better understand why that dealer order system was running on equipment all by itself (because the tribal knowledge about the end of month peak loads was documented), and you can then also compare the current state against the expected state to identify drift issues and take action to address them as appropriate.  For example, you might trigger an incident and assign ownership to the relevant administrators who own the automations for that equipment and/or applications.

ITinvolve’s Drift Manager can help you implement both capabilities and more.  Drift Manager helps you document scripts and automations as well as “gold standard” configuration settings leveraging information you already have (via importing or federation) while also capturing the undocumented tribal knowledge and validating it through social collaboration methods and peer review.  Drift Manager also helps you compare the current vs. expected state in real-time and then facilitates raising drift incidents when required.  What’s more, ITinvolve helps you “broadcast” upcoming configuration changes so all relevant experts are included in your configuration management process and can fully assess the risk and implications to avoid the kinds of issues discussed above.  Finally, it ensures your teams are aware of the policies that govern your resources so that, as configuration changes are being considered, the potential policy impacts are considered at the same time.

No matter your approach, configuration drift will happen.  The question is, do you know about it when it happens and can you get the right experts quickly engaged to address it without causing other issues?

Matt Selheimer
SVP, Marketing

Leave a Reply