Job Description
As a Site Reliability Engineer (SRE), you will play a crucial role in enhancing the reliability, scalability, and performance of our infrastructure.
This position involves applying core SRE principles to ensure system availability, troubleshooting critical issues, and collaborating across teams to optimize production systems.
Key Responsibilities:
* Apply SRE principles (SLI/SLO/SLA) to improve system reliability and eliminate toil.
* Build, maintain, and evolve SLO/SLI baselines for networks, systems, and applications.
* Collaborate with product teams for go/no-go planning, validation, and testing of new services/products.
* Analyze data and ensure the integrity of systems to optimize production performance.
* Troubleshoot and resolve business-affecting issues, working closely with internal teams.
* Implement best practices for system reliability and operational workflows.
* Lead incident response, perform root cause analysis (RCA), and contribute to blameless post-mortems.
Qualifications
* 5+ years of experience with cloud/web/CDN infrastructure.
* Proficiency in Python and Go; C/C++ experience a plus.
* Strong knowledge of Linux systems and network protocols (TCP, UDP, DNS, TLS/SSL, HTTP).
* Experience with Prometheus, Grafana, Git Lab, Jenkins, and CI/CD practices.
* Familiarity with big data technologies (Redis, Elastic Search, Kafka) and container management (Docker, Kubernetes).
* Strong collaboration, communication, and documentation skills.
This is an excellent opportunity to be part of a rapidly growing company and take a key role in scaling and maintaining mission-critical systems.