Job Summary Key Responsibilities:
Monitoring and Alerting: Implementing and maintaining monitoring systems to track system health and performance, alerting on symptoms rather than just outages.
Incident Response: Responding to and resolving production incidents, troubleshooting across the entire stack, and providing support for product teams.
Automation: Developing and implementing automation to streamline operational tasks, improve efficiency, and reducing manual effort.
Infrastructure Management: Managing and maintaining infrastructure, including platforms
Performance Optimization: Identifying and addressing performance bottlenecks, optimizing existing systems, and contributing to system design and capacity planning.
Collaboration: Working closely with development, operations, and other teams to ensure smooth deployments and efficient operations.
Continuous Improvement: Continuously improving systems and processes through post-incident reviews, documentation, and knowledge sharing.
Proactive Problem Solving: Identifying potential problems before they occur and developing solutions to prevent future issues.
Capacity Planning: Ensuring that systems can handle current and future demands.
Mentoring and Coaching: Sharing knowledge and providing guidance to junior engineers.
Skills and Qualifications:
MNCJobs.co.za will not be responsible for any payment made to a third-party. All Terms of Use are applicable.