Job Summary
This is a Support Engineer role on the AWS Incident Response team. The successful candidate will lead projects and build processes to reduce the duration, frequency, and impact of issues within the AWS and Amazon infrastructure.
About Us
AWS Infrastructure Services owns the design, planning, delivery, and operation of all AWS global infrastructure. We're the people who keep the cloud running.
Key Job Responsibilities
* Critical Issue Resolution and Call Management: Act as the primary point of contact in a team rotation for customer impacting issues.
* Root Cause Analysis and Prevention: Identify and analyse recurring platform issues, leading projects to address root causes and implement long-term preventative measures.
* Automation and Efficiency Projects: Apply scripting and automation skills to projects that improve team efficiency and operational excellence, reducing manual work and streamlining incident resolution processes.
* Documentation and SOP Development: Design, create, and review documentation, including new standard operating procedures, to improve knowledge sharing and incident response speed.
* Mentorship and Knowledge Sharing: Provide mentorship to peers in technical troubleshooting and incident management best practices.
* Global Project Leadership: Lead cross-functional, global project teams to implement operational improvements and automation initiatives.
BASIC QUALIFICATIONS
* Technical Troubleshooting and Debugging: Proven experience in troubleshooting and resolving complex technical systems issues.
* Analytical Documentation Skills: Experience in documenting technical findings and analysis.
* Scripting Knowledge: Practical programming ability with at least one scripting language (e.g., Python, Shell Script, PowerShell, Ruby, etc) to automate routine tasks and improve efficiency.
* Technical Support Background: 3+ years experience in technical support, incident response, or a related field.
PREFERRED QUALIFICATIONS
* Advanced Monitoring and Observability Skills: Experience with monitoring tools (e.g., CloudWatch, Datadog, Prometheus) for proactively identifying and resolving performance issues.
* Expertise in Incident Management and Call Facilitation: Demonstrated experience managing high-stakes, multi-participant incident calls, with the ability to communicate clearly and organize on-call team members effectively.
* CI/CD and Process Automation: Familiarity with CI/CD pipelines and automation best practices to continuously improve the team's deployment and incident management workflows.
* Collaboration and Cross-Team Communication: Strong skills in collaborating across technical teams, documenting incidents, and sharing findings with both technical and non-technical stakeholders to foster operational transparency.