About the Role
We are seeking passionate, talented, and inventive engineers to play a pivotal role in the development and maintenance of industry-leading multi-modal and multi-lingual large language models (LLM).
The Artificial General Intelligence (AGI) team's mission is to leverage our hyper-scalable, general-purpose large model training and inference systems to develop and deploy cutting-edge sensory AI foundational models that revolutionize machine perception, interpretation, and interaction with humans and with the physical world.
Our Culture
We believe in the importance of sharing learning experiences from the front line with the development teams. Our culture empowers Amazonians to deliver the best results for our customers. We have a strong focus on process and methodical improvement, making it an ideal environment for those who love mastering a domain and going deep, juggling multiple tasks, or keeping their head down and coding to support the team.
Key Responsibilities
* Provide support for cluster and node management, ensuring smooth operation of LLM infrastructure.
* Continuously improve and automate our cluster/capacity/maintenance upgrades.
* Develop automation tools for improving operational excellence.
* Work on operations and maintenance driven coding projects, primarily in Ruby, Rails, Java, Python, or shell scripts, AWS, web technologies projects.
* Drive Company Wide Campaigns with Support and Engineering teams and drive it to closure.
* Participate in design and code reviews and identify bottlenecks.
* Troubleshoot and research root causes thoroughly and resolve defects.
BASIC QUALIFICATIONS
* 3+ years of administrative experience in networking, storage systems, operating systems, and hands-on systems engineering experience.
* Experience programming with at least one modern language such as Python, Ruby, Golang, Java, C++, C#, Rust.
* Experience with Linux/Unix.
* Experience with CI/CD pipelines build processes.
PREFERRED QUALIFICATIONS
* Experience with distributed systems at scale.