Apple Services Engineering team is one of the most exciting examples of Apple’s long-held passion for combining art and technology. Join Apple Services Engineering Cloud Service Infrastructure team, as a Site Reliability Engineering Manager, to help support and scale cloud services for millions of Apple users.
We are building and supporting new and existing critical infrastructural systems and frameworks which provide and support services like structured and unstructured storage, caching, queueing, searching, and much more at hyperscale. These form the platform upon which many iCloud and other backend systems at Apple are built. The team is responsible for the next generation platform that will power Apple’s infrastructural services. These services operate at extremely large scale and store exabytes of data. The platform will support a variety of services based on open-source software, such as Kubernetes, Cassandra, Zookeeper, Kafka, Redis, etc, alongside internally developed services.
This is a hands-on role, to establish SRE practices for a private cloud service, to accelerate our ability to reliably and consistently deliver thousands of applications. You will lead a team of Site Reliability Engineers who thrive in a fast-paced workplace, where drive and collaboration are the keys to success!DescriptionThe Apple Services Engineering Cloud Services SRE organization is looking for a strong, hands-on leader. The leader will lead a platform focused SRE team, and be responsible for the reliability of the platform. The platform serves workloads that provide our organisation and our customers with their favourite applications, services, and tools.
We are domain experts in fleet management, systems, and software engineering. We build automations, instrument reliability tools, and respond to alerts and incidents which may pose a risk to the reliability of the platform. Team’s focus is on infrastructure capabilities and processes, improving the reliability and efficiency of the systems, at scale.
Responsibilities include:
* Act as the Service Owner, designing and mapping key performance indicators to achieve the organization’s mission
* Lead the definition of requirements, priorities and planning of engineering deliverables
* Implement structured engineering and operations processes
* Lead the team in daily agile SRE practices, ensuring proper team focus on priorities, achievements, and deliverables
* Optimise velocity and efficiency of delivery, and drive continuous improvement
Success depends on strong understanding of SRE principles and practices, combined with a track record of resolving issues in a live production environment, and implementing strategies to minimize them while driving clear action plans for the team.
The successful candidate will be highly self-motivated with a passion for excellence, quality, and detail. As a leader, they are responsible for coaching and mentoring their team members, helping them achieve service goals, and build career paths in alignment. It’s imperative for the leader to empower their team by providing appropriate context and timely feedback.
The leader will not only own the service, but will also collaborate with other teams within Apple. They will build trust with stakeholders and partner through diplomacy, discussion, and follow-through. This is a broad cross-organisation role with high-visibility, collaborating with multiple teams. They are expected to invest in and build good relations with key partners. Their collaboration with internal customers, product engineering, and development groups is critical to success.Minimum QualificationsExperience in critical, large scale distributed systems experience, combining Hardware, Operating Systems and SoftwareExperience building and leading engineering teams; ideally SRE or Production EngineeringStrong emphasis on SRE as an engineering subject area, with proficiency in at least in one of the following languages (Golang, Rust, Python, Swift)Understanding of SRE principals, including monitoring, alerting, error budgets, fault analysis, and other common reliability engineering concepts, with a keen eye for opportunities to eliminate toil by code and process improvementsSuperb interpersonal skills, capable of working with multi-functional technical and business teams and varying levels of management, influencing decision makingBachelors or Masters in Computer Science, Computer Engineering, or equivalent experience.Key QualificationsPreferred QualificationsWorking with large bare-metal infrastructure and release management.Experience with large scale server provisioning, fleet management and maintenanceExperience with development within Kubernetes ecosystem, including operator framework, controllers and CRDsHardware bootstrap and associated security (PXE, BIOS, TPM, secure boot, trusted computing)Automating operations processes via services and toolsConfiguration management and fleet orchestration via Puppet, Chef, Ansible, or othersEducation & ExperienceAdditional Requirements