Incident Management Engineer, AWS Incident Detection and Response
Description
About the Role
AWS Support is looking for a leader with a strong background in Incident Management and customer ownership to be there during the moments that matter for our most critical customers. We are looking for a Major Incident Manager to join our team to provide incident response and account ownership.
Key Responsibilities:
* Drive the resolution of large-scale customer impacting incidents as part of a team rotation.
* Drive critical, complex customer escalations in situations that are sometimes technically challenging in collaboration with Engineering Teams.
* Provide critical incident response/management (including leading calls with internal/external participants) for customer's critical workloads.
* Contribute to Problem Records for customers.
* Conduct continuous real-time proactive monitoring of customer metrics.
* Prioritize, manage, and own emerging and developing customer issues from start to finish.
* Monitor and manage communications during high impact events via relevant channels.
* Collaborate with key stakeholders across AWS to improve the customer experience and develop mechanisms that support operational excellence.
* Lead projects and virtual teams to drive operational improvements.
* Create and review documentation; design/influence new standard operating procedures.
* Identify and troubleshoot recurring platform issues and own projects to drive improvements.
* Mentor peers in your areas of technical and operational strength.
* Perform other duties as required by the organization.
BASIC QUALIFICATIONS
* 1+ year of experience in a similar role.
* 2+ years of virtualization, orchestration, and cloud computing (e.g., Hypervisors, VMware, Xen) experience.
* 1+ year of network and operating system support experience.
* Bachelor's degree in computer science or equivalent, or 3+ years of technical support experience.
PREFERRED QUALIFICATIONS
* Experience creating or designing cloud application architectures with a focus on high availability and fault tolerance.
* Experience with data manipulation and/or automation using Python, JavaScript, or shell scripting.
* Effective prioritization and time management skills and an ability to work in ambiguous environments.
* Demonstrated critical thinking and logical problem-solving skills.
* Familiarity operating or designing distributed architectures with the ability to correlate system behaviors based on known inter-dependencies.