IT Infrastructure Operations & Problem Management

it-infrastructure-problem-management.jpeg

What does a standard IT department look like? We are familiar with calling a help desk when there’s a failure but managing an IT department is much more than just a help desk technician on the phone. By understanding standard IT management and support practices, the modern network architect can build a more effective network infrastructure.

General operations roles are made up of those who manage the daily systems of the network and those who fix problems found on the network.

Managing daily systems of the network:

  • Management of a network infrastructure is the role of Operations and headed up by the operations manager.
  • The Facilities team falls under Operations and manages the day to day hardware and software in a company’s network infrastructure.
  • Incident management, also under operations, manages the resolution of failures across the networks when the issues are known.

Fixing problems found on the network:

  • Problem Management resolves unknown issues across the infrastructure.
  • Event Management is the early warning system for the network designed to catch problems before an actual failure on the system occur.

Over 18 years ago there was no automated event management system. The process to review and discover issues was based on watching the autoexec.bat file for errors when the system booted. A trouble shooter would then manually checking disk space, memory allocation and log files on the system. The first two hours of a shift were spent reviewing log files and researching (without the internet) possible meanings for an error.

As networks systems became a standard, they became too large to manage manually. Event management systems automate this process. Managed Services Systems focus on the event management system, which makes the types of Service Level Agreements we see in cloud system possible. Working with a facilities team, event management systems helps teams to manage 100’s or even 1000’s of servers.

Two teams manage failures on the network

The incident management team is responsible for bringing systems back on line as quickly as possible. Every minute a system is down is lost profit for the business.

In contrast, the problem management team is responsible for preventing the incidents from reoccurring.

"How can an incident be resolved if it’s not fixed?"

A typical example:  A failing server is rebooted.

The server is running fine, but after a day or two, it begins to slow and fail. It’s discovered later that memory is not being deleted from RAM after an operation is completed. (Called a memory leak) 

Rebooting clears the memory and the system runs fine. Yet inevitably the RAM again fills because the software is still not releasing memory. The system begins slows down until it once again begins failing.

In this example, the incident management team's job is to reboot the server and get the system up and running. The problem management team's job is to review the memory dump to verify the memory leak.

The Conflict:

A conflict exists since the memory dump requires time before rebooting the system. The incident management team presses for a reboot, while the problem management team presses for time to download the memory dump. 

Incident management is not concerned with the root cause of the problem, their goal is to get the system back online. Problem Management is focused on stopping the incident from happening again. This conflict of interest both teams in a constant conflict. It is an expected conflict and the operations manager becomes the ultimate referee.

Booting a server may make the system operational again, but does not address the root cause of the problem and eventually it will happen again.

Going Deeper with Support:

Often, the incident management team is broken into support levels referred to as tiers. The first tier is a triage level. The tier 1 support technician will try to identify the problem to: 

a)  Fix a known error
      or
b)  Pass the incident to the appropriate tier 2 support team.

These teams will investigate known issues to find a solution to the problem. The tier 3 support technicians have the deepest level of training in the specific technology and responsible for the final resolution of all Incidents

Tier 3 support may put together a major incident team. To resolve an incident, tier 3 team’s responsibility includes contacting anyone and everyone to resolve the Incident. This includes contacting outside vendors and manufacturers of software and hardware technical support teams. Major incident teams are put together to coordinate, document and manage this final stage of the Incident process.

Resolving the incident means bringing the system back online and functional. Once an incident is resolved, the incident team’s job is complete but the problem management team's job has just begun.

Known or Unknown?

Resolved major incidents are discussed by the operations management team to determine if the major incident is a known issue. If not, the Incident becomes a problem and is passed to the problem management team.

Problem management’s job is to find the root cause for each problem ticket and look at the hardware, software, drivers and other possible causes. They may bring in manufacturers who developed the components that failed. Once a problem is determined, the cause, symptoms, fix and/or work-a-round is documented. The solution to the problem is placed into the Incident team database. The incident team will now have access to solve the known issue without escalating the problem to the top tier levels.

Finally:

In this way, an IT department maintains a network.

  • Day to day management is handled by the facilities team.
  • Incidents are failures that are managed by the Incident team.
  • Incidents are managed through three levels of support.
  • Problems are failures without a known cause.
  • Problem management determines the cause, the solution and records this in the incident support database.
  • Finally the entire team is management by the Operations Manager

These are the core roles within the Network infrastructure Operations team.

It can appear complicated and the conflicts are not always easily resolved. If your teams are not fully in sync, please contact me here >  I would be happy to talk to you about where your systems and teams are breaking down.

Topics: Competitive Advantage Business Technology IT Operation