Job Description
We are seeking a Site Reliability Engineering Manager to lead our Cloud Service Infrastructure team. As a key member of Apple Services Engineering, you will help support and scale cloud services for millions of Apple users.
Key Responsibilities:
* Act as the Service Owner, designing and mapping key performance indicators to achieve the organization's mission
* Lead the definition of requirements, priorities and planning of engineering deliverables
* Implement structured engineering and operations processes
* Lead the team in daily agile SRE practices, ensuring proper team focus on priorities, achievements, and deliverables
* Optimise velocity and efficiency of delivery, and drive continuous improvement
The successful candidate will have strong understanding of SRE principles and practices, combined with a track record of resolving issues in a live production environment, and implementing strategies to minimize them while driving clear action plans for the team.
Requirements
To be considered for this role, you must have:
* Experience in critical, large scale distributed systems experience, combining Hardware, Operating Systems and Software
* Experience building and leading engineering teams; ideally SRE or Production Engineering
* Strong emphasis on SRE as an engineering subject area, with proficiency in at least one of the following languages (Golang, Rust, Python, Swift)
* Understanding of SRE principals, including monitoring, alerting, error budgets, fault analysis, and other common reliability engineering concepts, with a keen eye for opportunities to eliminate toil by code and process improvements
* Superb interpersonal skills, capable of working with multi-functional technical and business teams and varying levels of management, influencing decision making
* Bachelors or Masters in Computer Science, Computer Engineering, or equivalent experience.
PREFERRED QUALIFICATIONS
* Working with large bare-metal infrastructure and release management.
* Experience with large scale server provisioning, fleet management and maintenance
* Experience with development within Kubernetes ecosystem, including operator framework, controllers and CRDs
* Hardware bootstrap and associated security (PXE, BIOS, TPM, secure boot, trusted computing)
* Automating operations processes via services and tools
* Configuration management and fleet orchestration via Puppet, Chef, Ansible, or others
Estimated Salary: $150,000 - $200,000 per year