Guardian of Global Digital Infrastructure
About the Role
Imagine building systems that protect Amazon's services across 20+ global marketplaces - that's the scale we operate at. You'll join one of our elite teams that sits at the intersection of innovation and reliability, where your code will serve as the backbone of Amazon's operational excellence.
Key Responsibilities
* Design and implement large-scale systems processing petabytes of data daily
* Build and maintain high-quality, thoroughly tested software solutions
* Create tools and mechanisms that help service teams identify and prevent availability risks
* Develop real-time monitoring and analysis capabilities
* Collaborate with teams across Amazon to improve service resilience
* Participate in on-call rotations to support business-critical systems
A Day in the Life
You'll work in an agile environment, designing and implementing solutions that operate at Amazon scale. This could involve:
* Building real-time data processing systems that analyze service health
* Developing mechanisms to surface and prevent reliability risks
* Creating actionable insights that help teams deploy changes safely
* Collaborating with service teams to implement resilience best practices
* Contributing to systems that process and analyze logs from thousands of services
About the Team
You could join one of two specialized teams within Central Reliability & Response Engineering (CRRE):
Operational Intelligence (OI) Team
* Owning Real-Time Log Analysis (RTLA), a critical platform used by thousands of internal customers
* Helping teams monitor and categorize service errors in real-time
* Enabling root cause analysis within minutes of issues occurring
* Processing and analyzing massive amounts of log data daily
Resilience Insights and Safety Engineering (RISE) Team
* Creating tools to help services maintain availability under any conditions
* Developing frameworks for assessing and improving service resilience
* Building systems to ensure safe deployment of code and configuration changes
* Providing actionable insights for improving service reliability
Requirements
* Bachelor's degree or equivalent
* Experience programming with at least one modern language such as Java, C++, or C# including object-oriented design
* Experience contributing to the architecture and design of new and current systems
* PREFERRED QUALIFICATIONS
* Experience with full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations
* Experience building complex software systems that have been successfully delivered to customers
* Experience using or building tools in the Observability space, such as log analysis, tracing, or monitoring