Details of the offer

Job DescriptionSRE ManagerA global travel and tourism company with a presence in Malaysia, this organization provides a wide range of travel services, including vacation packages, flights, and accommodation. It focuses on delivering personalized travel experiences and exceptional customer service to leisure travelers.The Team Manager of Service Reliability Engineering will be responsible for overseeing a team of 3 - 4 SREs, driving the strategic direction for reliability initiatives, and ensuring the stability and efficiency of our IT infrastructure. This role requires a combination of technical expertise, leadership skills, and a proactive approach to problem-solving and continuous improvement.Responsibilities:Leadership and Team Management:Lead, mentor, and develop a high-performing team of Service Reliability Engineers.Foster a culture of accountability, collaboration, and continuous learning within the team.Conduct regular performance reviews and provide constructive feedback to team members.Strategic Planning and Execution:Develop and implement strategies for improving system reliability, availability, and performance in alignment with business goals.Prioritize and manage the team's workload, ensuring that key projects and operational tasks are completed on time and to a high standard.Collaborate with cross-functional teams to align on reliability goals and ensure seamless integration of reliability practices into the development lifecycle.Operational Oversight:Oversee the design, deployment, and maintenance of monitoring, alerting, and incident management systems.Lead the response to major incidents, ensuring that the root causes are identified and resolved effectively.Ensure that all systems and services adhere to security best practices and compliance requirements.Technical Expertise and Guidance:Provide technical leadership and guidance on complex engineering challenges, helping the team design and implement scalable, fault-tolerant systems.Drive the adoption of automation, CI/CD pipelines, and infrastructure as code to enhance system reliability and operational efficiency.Stay current with industry trends and emerging technologies, and advocate for their adoption when appropriate.Continuous Improvement and Innovation:Implement and refine processes for continuous monitoring, root cause analysis, and post-incident reviews to drive service improvements.Identify opportunities to optimize system performance and resource utilization, balancing cost, and reliability.Champion innovation within the team, encouraging the exploration and implementation of new tools, technologies, and methodologies.Stakeholder Communication:Act as the primary point of contact for reliability-related issues and initiatives, effectively communicating with senior management and other stakeholders.Prepare and deliver reports on system reliability, performance metrics, and incident resolution to both technical and non-technical audiences.Requirements:Technical Skills:Strong technical background in systems engineering, cloud infrastructure (AWS, Azure, Google Cloud), and networking.Proficiency with monitoring tools (e.g., Prometheus, Datadog, ELK Stack), incident management systems, and automation technologies.Experience with DevOps practices, container orchestration (Docker, Kubernetes), and infrastructure as code (Terraform, Ansible).Experience:7+ years of experience in IT operations, systems engineering, or a related field, with at least 3 years in a leadership or management role.Proven track record of leading teams responsible for the reliability, availability, and performance of critical systems.Education:Bachelor's degree in Computer Science, Information Technology, or a related field, or equivalent work experience.Optional:Certifications in relevant technologies (e.g., AWS Certified Solutions Architect, Certified Kubernetes Administrator).Experience in managing distributed systems, large-scale infrastructure, and complex IT environments.Familiarity with ITIL practices and frameworks#J-18808-Ljbffr


Nominal Salary: To be agreed

Source: Whatjobs_Ppc

Requirements

My - System Support Specialist (Risk Management)

Key Responsibilities: Liaise with various departments on all aspects of client trading Performing settings in MT4 Administrator and relevant systems Preparin...


Zeal Group - Kuala Lumpur

Published a month ago

Filenet Developer

Job Description: Job Title: FileNet Developer Job Mode: Contract Key Responsibilities: Design, develop, and implement custom FileNet applications and soluti...


Vdart Software Services Pvt. Ltd. - Kuala Lumpur

Published a month ago

Master Data Management Specialist

We never ask for payment as part of our selection process, and we always contact candidates via our corporate accounts and platforms. If you are approached f...


Tmf Group - Kuala Lumpur

Published a month ago

Ai Filmmaker (100% Remote - Apac)

Join Tether and Shape the Future of Digital FinanceAt Tether, we're not just building products, we're pioneering a global financial revolution. Our cutting-e...


Tether Operations Limited - Kuala Lumpur

Published a month ago

Built at: 2025-01-01T19:20:32.083Z