Job Requisition ID
24WD83739
Position Overview
Autodesk is seeking a DevOps/Site Reliability Principal Engineer to join the Autodesk Platform Services Team.
The team delivers high-value, exabyte-scale data streaming platforms that power desktop, mobile, and web products, enabling our product teams to build cohesive in-product data experiences and our partners to integrate, search, and expand granular data across all Autodesk products.
Responsibilities
* Lead all aspects of DevOps, CI/CD, Observability, Alerting, and Reliability to create and maintain a reliable, secure, scalable, and highly resilient data streaming platform.
* Lead team-level DevOps and SRE outcomes and initiatives.
* Configure and improve cloud infrastructure for service availability, resiliency, performance, and cost efficiency with increasing load over time.
* Be accountable for SLOs of the services by driving and improving processes including service reviews, fire drills, and HA assessments.
* Create innovative solutions to monitor health checks of data streaming apps.
* Keep system updates current for security compliance.
* Engage in technical discussions and technical decision-making.
* Build tools to improve operational efficiency.
* Serve as the primary point responsible for the overall health, performance, and capacity of the data streaming platform across Autodesk.
* Exhibit steadfast leadership and make substantial contributions to large-scale, intricate projects involving collaboration among multiple engineers and cross-functional teams.
* Drive the design, implementation, and management of expanding observability infrastructure while staying up-to-date with new technologies.
* Develop innovative solutions for a resilient data streaming platform at scale.
* Collaborate with software architects, product managers, and software developers to transform high-level SRE and DevOps requirements into incremental enhancements.
* Exhibit ownership of domain/large-scale platforms encompassing end-to-end responsibilities from Engineering Practices, Solutions, Quality, and Deployment to Support.
* Lead sustainable incident response, blameless postmortems, and production improvements resulting in direct business opportunities.
* Automate deployment, scaling, and management of infrastructure using modern DevOps tools and practices.
* Monitor and optimize system performance, troubleshoot issues, and implement solutions.
* Implement and maintain configuration management and infrastructure as code (IaC) using Terraform.
* Define and document best practices across all pillars of DevOps/SRE.
Minimum Qualifications
* BS or MS in Computer Science or related technical field or relevant experience.
* 9+ years of software engineering experience with proven experience in DevOps and SRE accountable for SLOs.
* Hands-on experience working with AWS, specifically S3, Lambda, SQS/SNS, and databases (Aurora, DynamoDB).
* Understanding and curiosity of SRE best practices, architectures, and methods.
* Experience in Continuous Delivery and deployment with Terraform.
* Excellent experience in Java, Python, Groovy, and other programming languages.
* Good knowledge of resiliency patterns and cloud security.
* Proficiency in using observability tools such as Grafana, Splunk, Dynatrace, DataDog, OpenTelemetry, or Prometheus.
* Experience with security compliance, such as SOC2.
* Hands-on experience with data streaming, transformation, and ETL technologies.
* Understanding of Apache Flink, Kinesis, Kafka, and Kubernetes.
* Proven capability to lead incident response, drive root cause analysis, and implement preventive measures.
* Expertise in DevOps/SRE practices, including IaC, configuration management, container technologies, microservices, CI/CD processes, etc.
* Strong problem-solving skills and capability to work on complex systems.
* Experience in working in an Agile environment.
* Experience in working with distributed teams.
Preferred Qualifications
* Passion for running and improving customer-facing systems with a high degree of availability (four 9's).
* Experience with databases and database design principles at cloud scale.
* Demonstrated experience leading complex, large-scale cloud and data streaming projects involving multiple teams/functions.
* Pasisonate about building and motivating engineering teams and experience with driving organizational strategic initiatives.
* Excellent verbal and written communication skills with experience collaborating in a dispersed multicultural team to deliver projects, sometimes responsible for leading initiatives.
* A perpetual learner who often finds themselves ideating about new and improved ways of doing things and confident to share ideas with the rest of the engineering team.
* Mature judgment when making engineering decisions and capable of reliably making calls between elegant and practical solutions.