Guardian of Amazon's Global Digital Infrastructure
Central Reliability & Response Engineering (CRRE) is a team within Amazon's Engine organization that combines engineering and impact at massive scale. We ensure millions of customers can shop, stream, and connect without missing a beat - 24/7, across the globe.
About the Role:
* You will build systems that protect Amazon's services across 20+ global marketplaces.
* You will join one of our elite teams that sits at the intersection of innovation and reliability.
* Your code will serve as the backbone of Amazon's operational excellence.
We are looking for exceptional Software Development Engineers to join our dynamic team in Dublin, Ireland - a tech hub that's home to some of Amazon's most critical reliability and resilience engineering initiatives.
Key Job Responsibilities:
* Design and implement large-scale systems processing petabytes of data daily.
* Build and maintain high-quality, thoroughly tested software solutions.
* Create tools and mechanisms that help service teams identify and prevent availability risks.
* Develop real-time monitoring and analysis capabilities.
* Collaborate with teams across Amazon to improve service resilience.
* Participate in on-call rotations to support business-critical systems.
A Day in the Life:
* Work in an agile environment designing and implementing solutions that operate at Amazon scale.
* Build real-time data processing systems that analyze service health.
* Develop mechanisms to surface and prevent reliability risks.
* Create actionable insights that help teams deploy changes safely.
* Collaborate with service teams to implement resilience best practices.
* Contribute to systems that process and analyze logs from thousands of services.
About the Team:
Our team focuses on two areas: Operational Intelligence (OI) and Resilience Insights and Safety Engineering (RISE).
Operational Intelligence (OI):
* Owns Real-Time Log Analysis (RTLA), a critical platform used by thousands of internal customers.
* Helps teams monitor and categorize service errors in real-time.
* Enables root cause analysis within minutes of issues occurring.
* Processes and analyzes massive amounts of log data daily.
Resilience Insights and Safety Engineering (RISE):
* Creates tools to help services maintain availability under any conditions.
* Develops frameworks for assessing and improving service resilience.
* Buils systems to ensure safe deployment of code and configuration changes.
* Provides actionable insights for improving service reliability.
BASIC QUALIFICATIONS:
* Bachelor's degree or equivalent.
* Experience programming with at least one modern language such as Java, C++, or C# including object-oriented design.
* Experience contributing to the architecture and design (architecture, design patterns, reliability and scaling) of new and current systems.
PREFERRED QUALIFICATIONS:
* Experience with full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations.
* Experience building complex software systems that have been successfully delivered to customers.
* Experience using or building tools in the Observability space, such as log analysis, tracing, or monitoring.