About The Role
At Qualcomm, we offer flexible work options tailored to our employees' needs. These include a combination of work from home and working in our brand new, state-of-the-art office in Penrose Dock, Cork.
Well-being and life balance are fundamental to us as an employer. We recognize and understand that employees have missed spending quality time with loved ones and extended family.
We are seeking a highly skilled Technical Support Engineer specializing in Machine Learning (ML) operations, Kubernetes, container technologies, and Run:AI.
Key Responsibilities
* Kubernetes Orchestration & Resource Management: Serve as the subject matter expert for Kubernetes and container orchestration. Guide customers through the design and deployment of Kubernetes clusters tailored for AI/ML use cases, helping them effectively manage workloads through Run:AI.
* Cluster Monitoring & Optimization: Monitor and tune Kubernetes clusters to ensure they are optimized for AI/ML workloads. Provide support on managing Kubernetes autoscaling, resource quotas, and performance monitoring of distributed ML models running on Kubernetes clusters via the Run:AI platform.
* GPU Troubleshooting and Incident Response: Diagnose and resolve complex issues regarding dependencies between GPU drivers and software, Nvidia toolkit errors, or GPU component failure.
* Run:AI Platform Support: Provide expert support for the Run:AI platform, assisting customers with the deployment, configuration, and management of Kubernetes clusters that handle AI/ML workloads.
* Workload Optimization on Kubernetes: Assist customers in optimizing dynamic resource allocation for their AI/ML workloads by utilizing the Run:AI scheduler in conjunction with Kubernetes's native tools.
* Kubernetes Troubleshooting & Incident Response: Diagnose and resolve complex issues related to Kubernetes cluster management, ensuring smooth operation across the entire Kubernetes environment.
* Integration Support: Help customers integrate Run:AI into their existing Kubernetes-based ML infrastructure.
* Security and Best Practices in Kubernetes: Advise customers on security best practices for Kubernetes clusters handling sensitive ML workloads.
* Collaboration with HQ: Work closely with the engineering and product teams in HQ, providing feedback on Kubernetes-related issues.
* Training & Documentation: Develop training materials and deliver technical workshops on using Run:AI in Kubernetes environments.
Requirements
* 3+ years of experience in technical support roles with strong expertise in Kubernetes administration, container orchestration, and AI/ML workload management.
* 1+ year of general GPU administration.
* In-depth knowledge of Kubernetes (CKA or CKAD certification highly preferred).
* Proficiency in Kubernetes resource management.
* Experience with configuration management tools (Puppet, Chef, Ansible) and Kubernetes management platforms like Rancher a plus.
* Experience with Run:AI platform or similar tools for ML workload optimization.
* Hands-on experience with Docker and containerized environments for AI/ML operations.
* Strong understanding of ML frameworks (e.g., TensorFlow, PyTorch).
* Excellent analytical, communication, and problem-solving skills.
* Ability to manage priorities in a fast-paced environment and collaborate within a matrix organization.