Job Description:
We are looking for a knowledgeable and experienced software development engineer to help us succeed in our mission to build systems that ensure AWS customers can rely on the highest-availability, lowest-latency cloud platform on the planet.
The team is responsible for designing and implementing systems which automate fault containment, problem diagnosis, and issue resolution across multiple hugely-distributed, always-on architectures.
These systems will take metric and dependency data from multiple sources and analyse them, correlating them with customer impact to determine root cause of an issue without human intervention.
They will create engagements, facilitate communication and coordination of the response and mitigation.
As a Software Development Engineer at AWS Incident Response Systems, you will work closely with teams across AWS to drive adoption of the software that has been built by the team, and influence systems development practices for new and existing products.
You will define availability goals for service teams across AWS, and strategies to make these goals attainable with minimal effort.
Within your first year on the AWS Incident Response Systems team, you will have met with senior technical leaders from across AWS, designed and implemented at least one new system, and you will have dived deep into the causes of at least one historic external customer impacting event, and determined how to prevent a similar event from ever happening again.
Key job responsibilities include writing well-tested, maintainable code, designing, contributing to, and maintaining systems which solve customer problems, working with team-mates to improve code quality, system architecture and team processes, and learning about the incident management processes supported by the team's system to identify improvement opportunities.