PURPOSE OF THE ROLE
This role is within the Infrastructure Management Department but is responsible for incidents across the entire business. This role is responsible for incident and problem management of the operational sites of Teraco.
OBJECTIVES
MAIN FUNCTIONS OF THE JOB
Problem Management:
Analysing incidents to identify recurring patterns
Conduct root cause analysis to understand the underlying causes of problems.
Developing and implementing corrective actions to address root causes and eliminate future incidents.
Working with relevant teams to implement solutions and updates to prevent similar problems.
Ensure response teams are coordinated and effective in investigating and resolving major complex problems. (Responsible team will assume incident management responsibility for a given event)
Collaborate with subject matter experts to resolve complex problems & track problem lifecycle from identification to resolution.
Track tickets for all corrective actions and validate that the corrective actions are implemented as required.
Maintain a problem knowledge base and documentation to share learnings across the organization to facilitate quicker resolution of similar incidents in the future
Manage problem resolution bridges, provide timely and clear updates to stakeholders, and document critical action items to drive resolutions.
Own and lead a structured Root Cause Analysis (RCA) process to resolve major incidents and problems.
Facilitate root cause and corrective action plan meetings, after the implementation of the correction. Ensure the responsible managers, documenting incident details and post-incident analysis to learn from events, and that incident reports reflect all root causes, corrections and corrective actions.
Drive teams to document and submit incident reports within OLA and SLA
Signatory on all incident reports across the business.
In collaboration with the Client Experience Manager, identify improved reporting formats and templates. Drive consistency across Teraco's operational organisation.
Review incident response plans and procedures and identify improvement opportunities using data and metrics
Incident and Problem Management Framework:
Implement a clear and concise Incident and Problem Management framework to ensure incidents are handled in line with established policies and procedures, and to increase efficiency of incident response
Establish various root cause analysis techniques to identify the root causes and coach leadership in effective root cause analysis where required to drive a culture of effective root cause analysis.
Ensure communication plans are in place and ready for activation during major incidents
Create communication and escalation framework to ensure stakeholders are kept up to date about the incident status and impact. DCO staff will assume incident management responsibility for a given incident & Facilitate communication during incidents to ensure coordinated response.
Collaborate with the Client Experience Manager on client impacting incidents, to ensure client's interests are central to Teraco's response to incidents, and that there is effective communication with clients.
SKILLS REQUIREMENT
Strong root cause analysis (RCA) methodology (e.g., 5 Whys, Fishbone diagram, Fault Tree Analysis)
Data analysis and pattern recognition for incident trend identification
Excellent written and verbal communication, especially in high-pressure situations
Experience drafting, reviewing, and communicating incident reports
Ability to facilitate and document cross-functional meetings and corrective action plans
Ability to design and implement incident/problem management frameworks
Continuous improvement mindset to enhance processes and reporting
Leading RCA and post-incident review meetings
Driving accountability in corrective action implementation
QUALIFICATIONS AND EXPERIENCE
Bachelor's degree in a relevant field (e.g., IT, Engineering, Business Management, or similar) preferred, or equivalent experience
Certifications (highly beneficial):
ITIL v3/v4 Foundation or Intermediate Level
RCA/Problem Solving training (e.g., Kepner-Tregoe, Six Sigma Yellow/Green Belt)
ISO standards familiarity (especially ISO 27001, 50001 or ISO 9001)
Experience Requirements
5+ years in incident and/or problem management roles, ideally within data center, IT infrastructure, telecoms, or similar high-availability environments
Experience in managing major incidents and leading post-mortems
Proven track record of implementing effective corrective and preventive action plans
Familiarity with operational workflows in critical facilities (e.g., infrastructure systems, networks)
Experience collaborating with client-facing and technical teams
Background in managing communication during major service disruptions
* Experience working within ITIL or other service management frameworks
MNCJobs.co.za will not be responsible for any payment made to a third-party. All Terms of Use are applicable.