About Huawei
Huawei is a leading global provider of information and communications technology (ICT) infrastructure and smart devices. With integrated solutions across four key domains – telecom networks, IT, smart devices, and cloud services – we are committed to bringing digital to every person, home and organization for a fully connected, intelligent world.
At Huawei, innovation focuses on customer needs. We invest heavily in basic research, concentrating on technological breakthroughs that drive the world forward. We have more than 180,000 employees, and we operate in more than 170 countries and regions.
About the IRC
Huawei Ireland Research Centre (IRC) mission is to position Huawei as a recognized technology leader and a global provider of information and communications technology (ICT) solutions. To achieve this we are building an industry-recognized multi-discipline Research Centre of experts with focus on medium-term to long-term issues. The IRC will work closely with an open innovative ecosystem with Huawei customers to address real-world issues. The IRC will also engage with key European universities to build a basic research capability to support Huawei technical projects.
About the job
Are you a researcher or engineer interested in the challenges of planet-scale cloud infrastructure? We are looking for people who are passionate about working on problems that lie at the intersection of academic research and practical industry implementations. This is a chance to be part of the strategic team that will tackle the Cloud Hardware Reliability challenges. This team works on improving the reliability of cloud infrastructure architecting and designing new features for the future servers.
The Cloud Reliability Lab at the Huawei Ireland Research Center has a mission to bring world-class reliability to Huawei Cloud by solving cross-functional problems that span Hardware, Software, Networking and Operations. We have teams working in all these areas with a diverse mix of talented people including industry experts, academic researchers, and Ph.D. interns. In your role, you will collaborate with the local research teams, other European research centers, and other engineering teams spread across the globe.
Responsibilities
* Understanding and investigating planet-scale technical problems. For example, defining new hardware reliability functionalities for globally distributed data centers.
* Architect and help design RAS telemetry for datacenter platforms and develop manageability solutions to monitor and maintain system health.
* Present findings and solutions tailored to the needs of key stakeholders, including engineering teams, senior management, customers, and external partners.
* Gather insights from the cutting edge of industry and academia regarding GPU hardware development. Help translate customer requirements, feedback, and market dynamics into potential feature requests to ensure a high reliability fleet.
Requirements
* Ph.D. or Master’s degree in Electrical Engineering or Computer Engineering or a related field
* Expert on datacenter telemetry, including scale solution for cloud scale telemetry
* Expert on scale server management including BMC management
* Proven knowledge on server telemetry/manageability protocols like SMBUS, I2C, I3C, Redfish, SPI.
#J-18808-Ljbffr