• Sound1618

The challenge of patching

A very high volume of enterprise security breaches can be prevented with system patching. That is a fact. It seems simple enough and most people working in Information Security see it that way. 

However, when you're working on the IT side of things, it's not that simple. Enterprises are complex organisms with a lot of requirements, most of them have a lot to do with a responsibility with customers. At the end of the day, the business must deliver.

There are quite a few factors that make patch Tuesday a task not as easy as it seems to be.

Service Availability

Not many people working in enterprise know this: There is a contractual demand with clients called Service Level Agreement (SLA). As the name indicates, it defines a legal obligation about the level of service the customer hiring your services expects, previously agreed by both parties. If said levels are not met, the provider is penalised. 

SLAs define service availability, in other words, systems uptime. When redundancy is not great because of budget or design reasons; and rebooting means taking production offline, one starts making numbers: How long is that 98.8% of availability per month means, and how many updates can you install during that period of time?

A lot of customers demand availability reports from vendors to scratch some money here and there. When a large enterprise follows that process systematically, it could become millions. So, it's kind of important for them.

There's an additional element of complexity that not many people have in mind: Time zones. I had jobs where we had to calculate the only 3 hour window to patch servers between 8 different time zones and taking into account the people working during weekends. That had to be recalculated at the start of Daylight Saving Time in different regions of the world.

I've given up weekends and nights to make sure patching was done and everything worked correctly. That's paid overtime that my employer was willing to pay, but not all of them are which adds an extra limitation.

Resources

Normally, the people who assign the budget are detached enough from the business to see the IT department as a money pit. The Infrastructure budget is a number not easy to look at when you're a CEO or a CFO. Add to it licenses, other hardware, salaries and other marginal costs and it will become one of the primary non-productive expenses. Of all the elements from that list, the easiest ones to stretch out are people and redundancy. If redundancy fails, will put a toll on people as well. 

This indicates a fundamental failure from both IT and the executive branch: They don't speak the same language. Traditionally, IT fails to understand the rest of the business speaks in money. Your CEO's brain won't react the same way when hearing "we're going to get hacked" than "this is the amount of money you will lose if we get hacked".

The executive branch often fails to understand what those people working with cables in that room with a noisy fan mean to the business.

When you have a very small team overloaded with support and project requests to keep the business moving, what are considered nice things to have are set aside*. If your team is constantly extinguishing fires, there are no resources to setup any automation or process to improve things. This is applicable to most of the recurrent issues in enterprise IT and Software Development.

Having people willing to work outside hours is a constant debate. Occasional work during weekends and nights comes with the role to some degree, and I'll be suspicious if I come across an IT Engineer that never had to perform a task outside hours because it involved downtime or some risk. Not paying overtime won't stir the debate in favour of the business.

*doesn't generate revenue or affects directly to the customer

What about automation?

Good old SCCM, WSUS, Linux PM, Azure Patching and other automation tools improved a lot the professional life quality of millions of admins. It still does to this day and will get better with time. For example: A large portion of Azure Maintenance doesn't require a reboot anymore, the primary host runs its updates while your paused VMs get live-migrated to an already update host.

I am a big fan of automated patching, but it needs to be designed and implemented very thoroughly to avoid a potential disaster caused by an update that breaks components in one of your servers. This requires design and testing, which involves a testing pool with all your server types that gets updated a week prior the rest of the servers. And may require help from QA if your Applications are at risk of failing. 

Biggest downfall again is lack of time or resources to implement the automation. Which normally indicates a management (or lack of) issue.

Let me give you a piece of advise: Backup your patching management system configuration regularly in case the installation becomes corrupt. Not fun.

Automated patching requires human supervision as well. Tasks often fail, update packages get corrupted or rules need to be changed occasionally.

Technical constraints

In all the companies I have worked, there was at least one old server that we were afraid to touch because of how fragile it was. Let alone installing anything on it. "If we reboot, we can't ensure it will get back up" is a sentence I said more times in my career that I would like.

Everyone in IT has a story about a Junior Engineer rebooting a machine or a host that had years of uptime and it never came back online.

Same applies to software. You can have the most up to date OS, if the Legacy Application running on it is duct-taped together and won't live through an upgrade of .NET Framework you're navigating in the same muddy waters.

Doing risk management to filter out which update will contain a distribution that will break your software is time consuming as well.

Software is hard. Persistent legacy software is inevitable, there are managers that don't want to pay a new license, customers that don't want to change versions and users that don't want to learn how to work on a new interface. It is a nightmare for everyone and getting rid of it is a very lengthy and difficult negotiation process.

Last but not least, risk management of having to roll back: What will happen, what could be possibly break, are we comfortable with the process of rolling back.

And while the policy says everything shall be up to date, real world moves at a slower speed.

On the bright side, your home computer should be easy enough to patch. So don't forget to install your updates!

  • White Twitter Icon