Engineering - Software (Information & Communication Technology)
As a Site Reliability Engineer (SRE), your role is to support and enhance the reliability and performance of vital services, bridging development and operations. You'll contribute to building a stable, scalable infrastructure, emphasizing solid system architecture and SRE best practices like Service Level Objectives (SLOs), Service Level Indicators (SLIs), and reducing manual operational tasks. This position involves collaborating with cross-functional teams to promote continuous learning and accountability while driving reliability improvements.
Responsibilities:
Architect and deploy robust systems designed for high availability and scalability.
Develop automation scripts and tools to streamline operations and minimize manual intervention.
Set, monitor, and analyze SLOs and SLIs to ensure systems align with business performance standards.
Perform in-depth post-incident reviews to identify root causes and implement improvement measures.
Work with development and operations teams to establish reliable system practices and effective incident management strategies.
Diagnose and resolve issues related to databases, network connectivity, and deployments, including underlying platform issues (e.g., Kubernetes, virtual machines).
Ensure adherence to Service Level Agreements (SLAs), maintaining high standards in service delivery.
Identify and resolve system performance bottlenecks, offering recommendations for optimization.
Qualifications:
Minimum of 3 to 8 years of experience in IT and around 1 year relevant experience in SRE
Proficient in languages like Python, Golang, or Java, with a focus on improving operational processes.
Proven experience in system design and architecture, emphasizing reliability and scalability.
Strong knowledge of SRE practices, including SLOs, SLIs, toil reduction, and post-incident analysis.
Familiarity with cloud environments (e.g., AWS, Azure, Google Cloud) and their management.
Solid background in Linux system administration.
Skilled in diagnosing and troubleshooting performance and connectivity issues.
Understanding of networking principles and effective troubleshooting strategies.
Excellent analytical skills and a proactive approach to operational challenges.
Able to work independently while collaborating effectively within a team setting.
Preferred Qualifications:
Experience with monitoring tools and performance optimization.
Skilled in scripting or automating administrative tasks.
Knowledge of networking principles and troubleshooting techniques.
Hands-on experience with cloud platforms (e.g., AWS, Azure, Google Cloud).
Familiar with DevOps frameworks and practices, including CI/CD, infrastructure as code, and containerization.#J-18808-Ljbffr