Job Summary
The company is seeking a passionate and experienced Lead Site Reliability Engineer to join their dynamic Site Reliability Engineering group within Enterprise Infrastructure.
Key Responsibilities
* Define and execute a comprehensive reliability and observability strategy to ensure systems are always available for customers.
* Troubleshoot stack-wide engineering issues across hardware, software, network, applications, and cloud service providers.
* Coach and mentor peer SREs and development teams on building highly available systems.
* Lead production bridges across teams during major incidents and conduct thorough post-mortem reviews.
Requirements
* Bachelor's degree (or higher) in a technology-related field (e.g., Engineering, Computer Science).
* Extensive hands-on experience deploying and supporting highly distributed multi-tiered systems at scale.
* Practical experience with Public Cloud platforms, preferably AWS or Azure.
* Proficiency with EKS, AKS, or Rancher Kubernetes Service for container orchestration.
* Experience with distributed architectures, including microservices, containerized services, and serverless architectures.
* Strong hands-on Kubernetes skills.
* Programming experience in compiled/OOP languages (e.g., C#, Java) and scripting languages (e.g., JavaScript/TypeScript, Python).
* Proven ability to maintain scalability and resiliency in complex environments.
* Familiarity with modern monitoring tools (e.g., Datadog, Prometheus, Splunk).
About the Company
The company values collaboration and continuous improvement. You will work in an environment where your contributions directly impact the reliability of critical systems and enjoy opportunities for professional growth and development in a supportive atmosphere.