At the heart of high availability for Amazon Web Services, AWS Incident Response plays a crucial role. Our team works to make customer-impacting events shorter and less frequent by providing large-scale event and incident management. Automated tooling quickly identifies issues, helping to mitigate their impact. Engineers spend most of their time on projects that improve this tooling and automation.
As a Support Engineer on our team, you will lead projects and build processes to reduce issue duration, frequency, and impact within AWS and Amazon infrastructure. You'll also direct high-visibility incident resolution through conference calls and teams across the globe. Using data from these incidents, you'll drive improvements into our automation, tooling, and processes to prevent future issues or shorten their duration.
Key Responsibilities
1. Critical Issue Resolution and Call Management: Act as the primary point of contact in a team rotation for customer-impacting issues. Monitor performance graphs, drive resolution calls with service team members, and page additional engineers as needed until the root cause is identified. This may include some weekends and holidays.
2. Root Cause Analysis and Prevention: Identify and analyze recurring platform issues, leading projects to address root causes and implement long-term preventative measures.
3. Automation and Efficiency Projects: Apply scripting and automation skills to projects that improve team efficiency and operational excellence, reducing manual work and streamlining incident resolution processes.
4. Documentation and SOP Development: Design, create, and review documentation, including new standard operating procedures, to improve knowledge sharing and incident response speed.
5. Mentorship and Knowledge Sharing: Provide mentorship to peers in technical troubleshooting and incident management best practices.
6. Global Project Leadership: Lead cross-functional, global project teams to implement operational improvements and automation initiatives.
A Day in the Life
As a Support Engineer, you have full visibility on all AWS services, offering limitless opportunities to learn. You work closely with internal AWS teams and gain insight into all AWS products and services.
When on call, we provide incident management capabilities through conference calls and automation, supporting internal AWS teams during the response, diagnosis, and mitigation of large-scale events.
When not on call, we build processes and automation to help AWS experience fewer, shorter, and smaller customer-impacting incidents.
About the Team
The AWS Incident Response (AIR) team is Amazon's central defense against large-scale incidents and drives operational excellence across all of Amazon businesses. Our key offering to Amazon is best-in-class Incident Management. Our engineers are front-and-center in driving down event duration through experience in operational excellence, current best practices, and incident management tooling.
AWS values diverse experiences. Even if you don't meet all of the qualifications and skills listed in the job description, we encourage candidates to apply. If your career is just starting, hasn't followed a traditional path, or includes alternative experiences, don't let it stop you from applying.
Why AWS?
Amazon Web Services (AWS) is the world's most comprehensive and broadly adopted cloud platform. We pioneered cloud computing and never stopped innovating — that's why customers from the most successful startups to Global 500 companies trust our robust suite of products and services to power their businesses.
Inclusive Team Culture
Here at AWS, it's in our nature to learn and be curious. Our employee-led affinity groups foster a culture of inclusion that empowers us to be proud of our differences. Ongoing events and learning experiences inspire us to never stop embracing our uniqueness.
We're continuously raising our performance bar as we strive to become Earth's Best Employer. That's why you'll find endless knowledge-sharing, mentorship, and other career-advancing resources here to help you develop into a better-rounded professional.
Work/Life Balance
We value work-life harmony. Achieving success at work should never come at the expense of sacrifices at home, which is why we strive for flexibility as part of our working culture. When we feel supported in the workplace and at home, there's nothing we can't achieve in the cloud.
Requirements
* Technical Troubleshooting and Debugging: Proven experience in troubleshooting and resolving complex technical systems issues.
* Analytical Documentation Skills: Experience in documenting technical findings and analysis.
* Scripting Knowledge: Practical programming ability with at least one scripting language (e.g., Python, Shell Script, PowerShell, Ruby, etc.) to automate routine tasks and improve efficiency.
* Technical Support Background: 3+ years' experience in technical support, incident response, or a related field.
* Advanced Monitoring and Observability Skills: Experience with monitoring tools (e.g., CloudWatch, Datadog, Prometheus) for proactively identifying and resolving performance issues.
* Expertise in Incident Management and Call Facilitation: Demonstrated experience managing high-stakes, multi-participant incident calls, with the ability to communicate clearly and organize on-call team members effectively.
* CI/CD and Process Automation: Familiarity with CI/CD pipelines and automation best practices to continuously improve the team's deployment and incident management workflows.
* Collaboration and Cross-Team Communication: Strong skills in collaborating across technical teams, documenting incidents, and sharing findings with both technical and non-technical stakeholders to foster operational transparency.