Contact ITInvolve

Archive for August, 2013

Oops, I Brought Down the Bank (and how it could have been avoided)

Wednesday, August 28th, 2013

Prior to working for ITinvolve, I spent many years managing IT Infrastructure & Operations teams. One such team was a group of engineers responsible for mid-range servers at a large bank in the New York area. We were a big NetWare and Notes shop at the time, and many of our business applications were Notes apps that read against flat files or btrieve databases hosted on drives mapped to NetWare file servers. We didn’t really understand the business criticality of many of these apps until a bright Wednesday morning in November.

What started with a simple need to make a few changes to the master login script for the bank turned into a full-blown catastrophe (and one that we could of avoided if we had been using a solution like ITinvolve). Here’s what happened.

Since changes to our login script didn’t occur often, we did a little extra due diligence to make sure we had a backup of the script, and decided to do it very early in the morning so that there would be time to fix it if we (I) screwed up the change. We thought we had done a good job of planning for the change and my director and VP agreed.

That Wednesday morning I took the 6:10AM train into work, powered up my laptop, grabbed a cup of coffee and fired up the proper utilities to make the changes, which I had written down in an email (sound familiar?). My plan was to just copy and paste between the email and the Netware admin – a 2 minute job max. As I was copying and pasting the information back and forth, though, a few folks walked in and we started chatting about Thanksgiving plans, while I continued to work on the script changes.

After they left, I hit the save button, which of course brought up the  “Are you sure?” prompt. As I moved the mouse to the “yes” button and began to raise my finger to click it, I just happened to notice that behind the window was a completely blank login script. All of a sudden everything went into slow motion as the index finger on my right hand clicked the mouse button. Even though I had noticed in that split second, it was too late. I had just erased the login script for all of the bank employees.

At that point, I thought, “Well, thankfully, we did a backup.” However, the backup person wasn’t due to arrive until 8:00AM. So I checked with another ops guy, asking him about restoring from the backup and he said, “Sure – what file do you need?”  I said, “It’s not a file; it’s an NDS object.”  “A what? I don’t know what that is or how to restore it,” he said.

At that point, I started to panic and thought, “Okay, maybe I can piece it back from memory.”  However, the script was around 15 pages long and contained lots of detailed conditions like – If Member of Group A then MAP N:=, etc. I quickly realized there was no way I could rebuild it from memory. Grasping for any solution, I thought, “Someone must have a printout of it somewhere. Who wrote this thing in the first place? He or she might know. Was there change history I could leverage?”

It turns out that no fewer than thirty people had a hand in writing and contributing to the script over the years. Of the thirty, maybe ten still worked for the bank and most had been transferred to different departments by now and wouldn’t likely remember what they had written years ago. I managed to find a couple of old hard copies of the login script, but the newest one was three years old and I was advised not to use it because it had more than likely changed a lot since then.

So I got on the phones and started a few escalations. First, I escalated the backup to restore the NDS object. The ops team was already working on this because of my earlier conversation with them, but the progress bar said it wouldn’t complete for four hours. It was just a little 50KB piece of data, but because of some other dependencies, it wouldn’t be restored until 11:00am. Next, I had to call my boss and let him know what was happening. He asked me if I knew what the potential impacts might be and I said, “Well, people won’t be able to get email, people won’t be able to access files on mapped drives and they won’t be able to print anything.” He asked me to call his boss, our VP of Infrastructure to let him know what was going on.

This was the hardest call I ever had to make in my career. When he answered, I said who I was and walked him through my epic mistake. I told him what we were doing to recover and what the ETA of the completed recovery would be. He was silent. He asked me what I thought the business impact of this was. I went on about mapped drives and printers and he stopped me mid sentence. He said that he knew the technology impacts, what he wanted me to tell him was the business impact. He then began to run through a long list of services that the bank provided that would not be able to function without the proper infrastructure mappings in place: the ATM network, branches not being able to open, checks not being printed, and the fact that many of our thousands of employees would not be able to do their jobs.

Based on his quick assessments, it was clear that I had pretty much just shut down a $12,000,000,000 bank. I really wished I could go back in time and undo what I had done.

Unfortunately, these types of simple mistakes happen in IT all the time. Maybe you haven’t brought down a major bank before, but I bet you can relate to at least a few stories where simple mistakes and unintended consequences from changes resulted in a BIG negative impact on your business.

If I had access to better documentation and a running history of changes made to the script I most likely could have recreated the script from scratch in an hour or less. If I had actually known the business value of this login script and the dependency on it for business critical functions, I wouldn’t have touched it without a much more rigorous impact analysis, I would have collaborated with my peers (and even business stakeholders) to identify ways to reduce the risk, and then executed a test for the change first, and I most certainly wouldn’t have chatted with my colleagues during the change.

With a collaborative IT management solution like ITinvolve all of this would have been possible so I could have had a much better chance of avoiding this mistake, and if it still occurred, we could have recovered much quicker and avoided a catastrophic four hour business outage.

If you’ve had a recent business impact caused by an IT change in your organization, give us a call so we can discuss how ITinvolve can help you avoid such issues in the future.

Joe Rogers
Director, Technical Services

More Infrastructure Changes with Less Risk

Thursday, August 22nd, 2013

Ask any Infrastructure & Operations leader if they’d like to handle more infrastructure changes with less risk to their business and you will get a resounding, “Yes!” However, this has been an elusive, and often frustrating goal for many. In fact, most IT organizations have so locked down their change process in order to avoid risks that the pace of change is little more than a crawl. Yet, 80% of business outages are still caused by IT changes. (A CIO of a major airline actually told me recently that it’s more like 98% of business outages are caused by IT changes for his company – ouch.)

Just last week, the New York Times experienced a high profile website and mobile application outage for three hours. At first there was speculation of a cyber attack (they had reported a denial of service attack some months earlier). But, how frustrating it must have been for their spokesperson and management to say the cause was actually — IT maintenance:

“The outage occurred within seconds of a scheduled maintenance update, which we believe was the cause,” Times spokeswoman Eileen Murphy said.

As every I&O leader knows all too well, even when changes are well-intentioned, things break. Our IT environments are becoming more and more complex and the lines and relationships between this component and that one aren’t as simple as “the knee bone connects to the leg bone.” Often, there are multiple-degree relationships between components that are hidden from understanding and critical knowledge that isn’t documented anywhere but only resides in the heads of experts who may be on vacation, been promoted, left, or let go long ago.

Without a new approach, Infrastructure & Operations organizations will continue to struggle with the pace of infrastructure changes and will generate frequent, unacceptable service interruptions leaving everyone on the business side with a bitter taste in their mouths.

That’s where ITinvolve comes in, because we have taken a fundamentally different approach that combines knowledge, analysis, visualization, and collaboration in one solution designed for IT — to accelerate changes while reducing risks. Check out this quick video to see it in action.

With ITinvolve, you will:

  • Quickly understand and visualize the impact of IT infrastructure changes
  • Engage all relevant stakeholders to assess the risk of those changes
  • Ensure exactly the right information is delivered to those who need it when they need it

The net result?

  • Faster change execution
  • Minimization of business risk
  • Increased change throughput
  • Reduction in unplanned work from IT changes
  • Improved IT performance, reliability, and security (by adopting patches and upgrades more quickly)
  • Improved change success rate

Just experiencing one of these benefits should be worthy of a conversation with one of our IT collaboration specialists. Contact us to get the discussion going. Certainly, it’s better than the status quo.

Matt Selheimer
VP, Marketing

Get the right info to the right people to make more accurate and faster decisions

Tuesday, August 6th, 2013

Every day, IT teams are under pressure to make quick yet accurate decisions. However, because IT organizations don’t typically have their collective knowledge easily accessible and usable in one place, these decisions are often made based on incomplete and often out of date information.

If not having the right information available at the right time for the right people to make good decisions is a challenge you struggle with, you are not alone.

In earlier posts in this blog series, I’ve talked about how you can use ITinvolve’s unique crowd sourcing and data federation capabilities to capture both systems-based and tribal knowledge to create a trusted, big picture view of your knowledge across each of your technology elements, policies, applications, and more.

Next you need to be able to visualize how all the elements of your environment, your policies, and your applications come together to deliver services to your business. ITinvolve provides this capability through what we call Perspectives.

Think of a perspective as a point of view on the objects and relationships necessary to deliver a service offering. Recalling our earlier analogy of the house with the ‘Million Dollar’ view, perhaps you and your spouse as well as your children all consider the view as a key attribute when shopping for a home. But maybe you are a car aficionado and want to have a three-car garage so your perspective is that this is an important aspect of the home buying decision. Perhaps your spouse likes to garden and so having a large enough green space is important to their perspective. And maybe your children want to be close to a playground or on a cul-de-sac where they can play freely without traffic so that’s part of their perspective on what house is important to them.

In IT, we have the same situation. Let’s take the example of a business application that supports Marketing. The application administrator’s perspective will include things like the application itself, the application server, and the underlying database. Because the application contains prospect and customer data, a Security administrator would care about the company’s customer privacy policy and how that governs the application as well as other applications. And a DBA’s perspective might include the Marketing application’s database, the underlying server, as well as other databases running on the same server that support other applications.

Each of these is a valuable perspective when making a decision, such as an infrastructure change that will impact the Marketing application. And each of these stakeholders should be brought together to collaborate and provide their risk assessment of the change. This is exactly what ITinvolve does and how we leverage your organization’s collective knowledge to provide impact analysis and proactively engage stakeholders so you can get the right info to the right people to make more accurate and faster decisions.

Check out this 3-minute video to see how it works. If you’re interested to learn more about how you can get the right info to the right people at the right time, sign up for a free trial.

Matt Selheimer
VP, Marketing