About the job
We’re looking for an exceptional technical leader to define, implement, and champion the adoption of next-generation Reliability KPIs across Huawei Cloud. In this high-impact role, you'll have the opportunity to shape how we measure and incentivize engineering efforts around reliability, ensuring a clear alignment between customer experience and cloud system performance.
This position requires not only deep technical expertise in cloud reliability but also the ability to influence senior leadership (at the SVP and CTO level) and drive large-scale organizational change. If you're passionate about aligning cloud infrastructure with business objectives through data-driven decision-making, this is the challenge you've been waiting for.
The Cloud Reliability Lab at the Huawei Ireland Research center has a mission to bring world-class reliability to Huawei Cloud by solving cross-functional problems that span hardware, software, networking, monitoring, and operations. We have teams working in all these areas with a diverse mix of people including industry veterans, academic researchers, and Ph.D. student interns. In your role, you will collaborate with the local research teams, other European research centers, and other engineering teams spread across the globe.
Responsibilities
* Lead the definition and evolution of Reliability KPIs across all Huawei Cloud services, ensuring they correlate with real-world customer experiences and organizational objectives.
* Guide the evolution of existing observability platforms in Huawei Cloud to balance the trade-offs between observability coverage, system performance, and operational cost. Build scalable solutions for high availability in observability systems themselves.
* Integrate Critical User Journeys (CUJs) into observability systems to ensure the reliability of the most critical customer-facing workflows.
* Evolve incident management practices to have a stronger alignment with Reliability KPIs, driving improvements in the incident response processes.
* Work cross-functionally with engineering, product, and operations teams to ensure observability and reliability best practices are embedded throughout the software lifecycle, from development to production.
* Drive large-scale organizational change to improve reliability culture across all engineering teams.
Requirements
* 10+ years of experience in cloud infrastructure, with 5+ years in architect or leadership roles.
* Proven expertise in architecting and defining Reliability KPIs in hyperscale cloud environments, ensuring they map directly to business outcomes (customer experience, service uptime, cost optimization).
* Deep understanding of distributed systems development, maintenance, debugging, and the trade-offs involved in building observability solutions at scale.
* Experience in driving organizational change around observability, reliability practices, and incident management.
* Exceptional communication skills, with the ability to align teams around a shared vision for cloud reliability and observability, while also influencing senior leaders to prioritize reliability in business and operational decision-making.
* Optional: Familiarity with the open-source observability tooling ecosystem (Prometheus, Grafana, Thanos, Clickhouse, OpenTelemetry, etc.)
#J-18808-Ljbffr