# Glossary

This section acts as a glossary of Technical Operations terms and provides definitions of:

key terms guided by Information Technology Infrastructure Library (ITIL) best practices
Key Performance Indicators (KPIs), that is, metrics that help determine whether specific incident management goals are met

# Key terms

Change management: Change management is the process of recording, approving, executing, closing, and reviewing all changes. The change can be either contractual (such as initial signature of contract, SLA upgrade, new resources needed, and so on), or operational arising from a change request.

Escalation: The acknowledgement of the fact that an incident requires additional resources in order to meet service level targets or user expectations, taking into account the criticality, impact, and urgency of the incident.

Helpdesk/Service Desk: The Single Point of Contact between the service provider and users. A typical Service Desk manages incidents and service requests, and also handles communication with the users.

Incident: An unplanned interruption to an IT service, or reduction in the quality of an IT service. The failure of a configuration item that has not yet impacted service is also an incident; for example, the failure of one disk from a mirror set.

Incident management process (IMP): The process of managing the lifecycle of all incidents. The primary purpose of incident management is to restore normal IT service operation as quickly as possible with the support of a whole organization in place.

Incident record/ticket: A record containing the details of an incident. Each incident record (also known as a ticket) documents the lifecycle of a single incident.

Priority: A category used to identify the relative importance of an incident or change. Priority is used to identify required times for actions to be taken.

Release management: Release management is the process of managing, planning, scheduling, implementing, and controlling a software build through different stages and environments, with the goal of delivering features to customers or end users.

Request for Change (RFC): The Request for Change (or simply Change Request) is a formal request for the implementation of a change. The RFC is a precursor to the “Change Record” and contains all information required to approve and execute a change.

Role: A set of responsibilities, activities, and authorities granted to a person or team. Roles are used to assign owners to the various incident management processes, and to define responsibilities for the activities in the detailed process definitions.

Root cause analysis (RCA): RCA is a collective term that describes a wide range of approaches, tools, and techniques used to uncover causes of incidents. It is called at each urgent incident, and every time an incident occurs more than once.

Service Level Agreement (SLA): An agreement between an IT service provider and a customer. The SLA describes the IT service, documents service level targets, and specifies the responsibilities of the IT service provider and the customer.

Severity: A measure of the effect of an incident on business processes.

TAT (Turnaround Time): This is the time taken from when the incident is reported to the time it is resolved and closed. It includes Guaranteed Intervention Time (GIT) and Guaranteed Resolution Time (GRT).

# Key Performance Indicators (KPIs)

Availability rate (Service availability rate): The whole technical solution’s availability to provide the service per DFSP.

Average Incident Closure Duration: Average amount of time between the registration of incidents and their closure.

Average Incident Response Time: The average amount of time (for example, in minutes) between the detection of an incident and the first action taken to repair the incident.

Average Number of Incidents Solved By Service Desk: Average number of incidents solved by Service Desk relative to all open incidents.

Guaranteed Intervention Time (GIT): The time elapsed between the time an incident is reported (for example, an email is sent to the Service Desk tool) and the time that an acknowledgement response is returned to the reporter of the issue.

Guaranteed Resolution Time (GRT): Sum of total time spent on resolving an issue from all parties. (The issue status must be “In Progress” or “Escalated” to count towards the sum total. Issue status “Pending” or “Closed” are not taken into account when calculating the sum total.)

Incidents Completed Without Escalation: The percentage (%) of incidents completed within the SLA without any escalation.

Incident Queue Rate: The number of incidents closed, relative to the number of incidents opened in a given time period.

Mean Time Between Failures (MTBF): The average time between repairable failures of a technology product. The metric is used to track both the availability and reliability of an IT service or any other configuration item, to assess if they can perform their agreed function without interruption. The higher the time between failures, the more reliable the system.

Mean Time To Acknowledge (MTTA): The average time it takes from when an alert is triggered to when work begins on the issue. This measures how long it takes an organization to respond to complaints, outages, or incidents across all departments on average. This metric is useful for tracking a team’s responsiveness and an alert system’s effectiveness.

Mean Time To Detect (MTTD) – “Proactive actions”: The difference between the onset of any event that is deemed revenue impacting and its actual detection by the technician who then initiates some specific action to recover the event back to its original state. This is not the same as starting the Mean Time To Repair (MTTR) clock (that is, once the technician receives a ticket). The onset of any revenue impacting event is almost always recorded at some specific time by some specific equipment. The key element is to bring the detection tool into the technician’s environment, and then measure the difference between the event’s timestamp and the technician’s first action indicating recognition of the event (MTTD).

Mean Time To Failure (MTTF): The average time between non-repairable failures of a technology product (mostly hardware).

Mean Time To Repair (MTTR): Refers to the average amount of time required to repair a system and restore it to full functionality.

The MTTR clock starts ticking when the repairs start and it goes on until operations are restored. This includes repair time, testing period, and return to the normal operating condition.

Mean Time To Recovery: Mean Time To Recovery is a measure of the time between the point at which the failure is first discovered until the point at which the service returns to operation. So, in addition to repair time, testing period, and return to normal operating condition, it captures failure notification time and diagnosis.

Old Incident Backlog: Number of open incidents older than 28 days (or any other given time frame) relative to all open incidents.

Percentage Of Incidents Solved Within Deadline/Target: Number of incidents closed within the allowed duration time frame, relative to the number of all incidents closed in a given time period. A duration time frame is applied to each incident when it is received, and sets a limit on the amount of time available to resolve the incident. The applied duration time frame is derived from the agreements made with the customer about resolving incidents.

Percentage Of Incidents Solved Within SLA Time: Total number of incidents resolved within SLA time, divided by the total number of incidents.

Percentage Of Outage Due To Incidents: Percentage of outage (unavailability) due to incidents, relative to the service hours.

Percentage Of Overdue Incidents: Number of overdue incidents (not closed and not solved within the established time frame) relative to the number of open (not closed) incidents.

Percentage Of Repeated Incidents: Percentage of incidents that can be classified as a repeat incident, relative to all reported incidents within the measurement period. A repeat incident is an incident that has already occurred (multiple times) in the measurement period.

← Defect triage Appendix A: Incident management escalation matrix →