The IT Handbook...you're reading...
General, Security, Support

Resilient IT departments – are you one of them?


Companies of all sizes have a strong reliance on IT systems in today’s business world. Even the smallest companies cannot operate without a fully functioning set of IT services, be it a malfunctioning bar code scanner or an entire company email system. The fundamental message is that IT is important in every organisation and IT Managers have the constant challenge to review potential outage scenarios, the chance of them occurring and put realistic measures in place to cope with inevitable situations where IT is not fully functioning.

First and foremost, the most important task of any established IT department, whether internal or outsourced, is to establish and manage expectations at a user and stakeholder level. Without realistic expectations, the spiraling effect of stress and disappointment will manifest, leaving users frustrated and IT departments stressed, worried and frantically running around to fix things in a tight time frame.

Whether formally or informally, a service level agreement (SLA) needs to be documented and agreed with management. Without this, the IT department have nothing to build their resilience strategy around because all parties concerned are likely to have different expectations and varying views on what is expected and acceptable. The first step in doing this is to assess the calculated risks involved with each IT service malfunctioning or becoming inaccessible, from hard disk failures to power outages and natural disasters. The risks need to be compared against probability and business impact.

For example, a natural disaster wiping out access to all systems may well be catastrophic in business

Mine

Metaphorical digger affecting your power supply? (Photo credit: realityfanclub)

terms but if the probability of happening is very remote, only basic resilience measures need to be put in place – backup drives (which form resilience for a number of scenarios) and a contingency to get them online is probably all that is required. However, every company is different for so many reasons – different things hold different values to different businesses, and outside risks such as fire, theft and natural disasters vary from site to site, let alone city to city.

Once each risk is assessed and the impact vs. probability matrix has been drawn up, the next step is to agree a set out outage time frames and frequency with management. From here, you will have an SLA from which to base your resilience strategy.  Resilience comes in two forms: Proactive and reactive. Both are needed to ensure a successful IT department and therefore company.

Proactive measures are ones which are put in place in case of an incident. The key here is to control the controllables.This is a great phrase at summing up what needs to be done. By putting resilience measures in place for what you can control, tackles what is reasonably possible. Some would argue they cannot control the metaphorical digger destroying electricity cables – but you can put measures in place, should the situation arise.

Reactive measures are those which must be done after something has occurred, meaning IT is not functioning correctly. These are measures that are not able to be done before a disaster scenario occurs – reactive measures are planned however. Steps taken to recover data from disks or purchasing new hardware are all reactive measures but have planned steps and methods, written in advance.

Risk Matrix

Risk Matrix (Photo credit: Martin Burns)

The priority is to ensure the IT services that have the highest impact in the image to the left, are dealt with first in decending orders of probability. These are your business critical IT services (marked in red) – most likely email, file servers and your website for e-commerce businesses. Anything that the company cannot operate without must not have a single point of failure – this could be electricity, internet cables or even backups.

Backups are critical to test regularly – imagine the horrifying scenario where your file servers have been stolen and you have just a single backup. You’ve never tested the backup or worse still, checked it has been running on a regular basis. Heads will roll in organisations if you cannot rely on the backup, the one time it’s needed. Proactive companies will have regular testing regimes, not only for data backups but for configuration and disk images too – after all, the technical among us will understand how long it takes to build a server and network configuration from scratch – a hint for those that don’t…it’s much quicker to have a backup! In addition to this, it’s very sensible to have a protected and tested backup set that is proven reliable for all configurations and disk images. Testing and replacing this backup set 2-4 times a year will guarantee a faster disaster recovery.

Once your company has achieved a setup where there are no single points of failures for high and medium impact systems (at a minimum) you are getting towards something acceptable. Of course, each company has different acceptable levels of proactive measures and in some cases two or three resilience measures are deemed a minimal acceptable level of protection. A local shop for example will not require a second site, set up as a contingency but a government office or the Bank of England most certainly would! Some of the companies who were sadly victims of the September 11 attacks in the World Trade Centre had second sites for their businesses which may have seemed extravagant prior to the incident but in hindsight proved powerful and valuable.

There is the common misconception with resilience – it all seems over the top until you rely on it. Even having multiple backup sets seems over the top to some, but if they’re used once, they’re worth their weight in gold.

Resilience measures are always designed to be of an acceptable functional standard but are also only an interim until normal service resumes. Therefore it’s very important to have a well-written, maintained and understood disaster recovery plan. A disaster recovery plan is designed so that anyone can pick it up, and know how to begin putting a recovery operation in place. The first instruction might be as simple as ‘contact the IT manager, or A.N Other if a disaster scenario has occurred.’ It could be more complex and we will detail all about disaster recovery in a future post.

To summarise, a resilient company in terms of its IT infrastructure is one which will succeed beyond its competitors. The reality that some stakeholders find hard to realise is that IT issues occur, they’re usually unpredictable and need to be planned for. Investing in proactive measures will only mean you can stay on top of your game at equivalent times when your rivals might not be able to. IT is a core and integral part of every business and it is the responsibility of IT Managers and CIOs to ensure their department is robust enough to cope with knock-backs.

Advertisements

About theithandbook

Reaching every day people and businesses with simple, effective and modern IT advice.

Discussion

No comments yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: