Tech Ops Engineer - Incident Management, Central Technical Operations Services (CTOS)
Amazon is seeking an exceptional Systems Engineer to join our world-class Central Technical Operations Services (C-TOS) team as an Incident Manager. As the first line of defense for maintaining high availability on the Amazon Retail Website, our C-TOS group provides critical incident response and management for the entire Amazon ecosystem. When issues arise that could impact our hundreds of millions of customers worldwide, our skilled Incident Managers spring into action to make event durations shorter, less frequent, and less severe.
This is immensely important, high-stakes work. The Amazon Retail Website is where we directly engage and delight our global customer base - any disruption can have a real impact on real people. That's why our C-TOS Incident Managers are so vital; leveraging deep operational expertise and the latest incident management tools, they work quickly to mitigate customer-impacting events.
This is an excellent opportunity to join one of Amazon's world-class engineering teams, working alongside some of the best and brightest minds in technology. Our engineers are encouraged to build solutions that enhance our incident management practice, including tooling and processes, as well as fix software problems - and then share those innovations across the organization. You'll have access to mentoring programs, regular tech talks with technical leaders, and well-defined career paths for motivated engineers who want to contribute to our culture of operational excellence and customer-focused innovation.
The C-TOS team is globally distributed, with groups in Austin, Dublin, and Sydney providing 24/7 coverage, each working 10-hour shifts for 4 days per week.
Key job responsibilities:
1. Serve as a technical evangelist, leveraging deep expertise to devise innovative solutions to complex business problems.
2. Drive down mean time to resolution for incidents through proactive monitoring, rapid response, and continuous process improvement.
3. Design, implement, and optimize world-class event detection, alerting, and incident management systems.
4. Evolve operations management processes and technologies to accommodate Amazon's rapid growth.
5. Create, review, and continuously improve documentation, procedures, and knowledge resources.
6. Identify and resolve recurring platform issues by collaborating cross-functionally with service owners.
7. Provide exceptional customer service by responding to and resolving requests within defined SLAs.
8. Participate in a global "follow the sun" rotation, ensuring 24/7 coverage including weekends and holidays.
9. Contribute to the interviewing and hiring process to build a world-class Incident Management team.
Minimum qualifications:
1. Bachelor's degree in Computer Science, Engineering, or a related technical field; or at least 7 years of relevant experience in a large-scale online operations environment.
2. Fluent written and verbal communication skills in English, with the ability to effectively collaborate cross-functionally.
3. Proficient in scripting and automation using at least one interpreted language (e.g. Java, Python, Perl) as well as shell scripting.
4. Strong working knowledge of Linux operating systems and networking fundamentals.
5. Proven track record of driving complex, collaborative projects from conception through successful delivery.
6. Experience with incident management, event detection, and operational excellence in a fast-paced, customer-centric environment.
7. Ability to thrive in a geographically distributed, "follow the sun" coverage model, including off-hours and weekend work as needed.
Preferred qualifications:
1. Experience with distributed systems at scale.
2. Experienced with Agile software development practices, including Scrum ceremonies and continuous improvement.
3. Background in architecting and supporting large-scale, distributed systems.
4. Track record of effectively leading and managing cross-functional incident response efforts.
5. Deep understanding of network technologies and troubleshooting to rapidly resolve complex issues.
6. Ability to collaborate closely with customers during high-pressure problem resolution, while remaining calm and focused.
7. Excellent prioritization, time management, and organizational skills in a fast-paced environment.
Acknowledgement of country:
In the spirit of reconciliation Amazon acknowledges the Traditional Custodians of country throughout Australia and their connections to land, sea and community. We pay our respect to their elders past and present and extend that respect to all Aboriginal and Torres Strait Islander peoples today.
IDE statement:
Amazon is committed to a diverse and inclusive workplace. Amazon is an equal opportunity employer, and does not discriminate on the basis of race, national origin, gender, gender identity, sexual orientation, disability, age, or other legally protected attributes.
#J-18808-Ljbffr