Lead Site Reliability Engineer
We are seeking a passionate and experienced Lead Site Reliability Engineer to join our dynamic Site Reliability Engineering group within Enterprise Infrastructure.
About the Role
This is a permanent position in Galway with amazing benefits and career progression opportunities. If you thrive in an environment that combines Operations Excellence with Development Experience, this could be the opportunity for you.
Key Responsibilities
* Define and execute a comprehensive reliability and observability strategy, ensuring our systems are always available for our customers.
* Troubleshoot stack-wide engineering issues across hardware, software, network, applications, and cloud service providers.
* Coach and mentor peer SREs and development teams on building highly available systems.
* Be an escalation point during major incidents, taking hands-on responsibility to lead production bridges across teams.
* Conduct thorough post-mortem reviews, focusing on deep technical root cause analysis, observability, and automation enhancements.
Requirements
* Bachelor's degree (or higher) in a technology-related field (e.g., Engineering, Computer Science).
* Extensive hands-on experience deploying and supporting highly distributed multi-tiered systems at scale.
* Practical experience with Public Cloud platforms, preferably AWS or Azure.
* Proficiency with EKS, AKS, or Rancher Kubernetes Service for container orchestration.
* Experience with distributed architectures, including microservices, containerized services, and serverless architectures.
* Strong hands-on Kubernetes skills.
* Programming experience in compiled/OOP languages (e.g., C#, Java) and scripting languages (e.g., JavaScript/TypeScript, Python).
* Proven ability to maintain scalability and resiliency in complex environments.
* Familiarity with modern monitoring tools (e.g., Datadog, Prometheus, Splunk).
* Technical and operational leadership with the ability to handle production incidents effectively.
What We Offer
* Be part of a vibrant team that values collaboration and continuous improvement.
* Work in an environment where your contributions directly impact the reliability of critical systems.
* Enjoy opportunities for professional growth and development in a supportive atmosphere.