Responsibilities:
1. Triage and investigation of live incidents
2. Execute technical return to service actions in a fast-paced, distributed systems environment specifically microservices to quickly restore service and protect player experience
3. Monitor the health of Riot’s distributed services using observability tools, identify gaps with alerting, runbook steps, processes or tools
4. Runbook execution and maintenance to keep documentation up to date
5. Onboarding new team members
6. Provide support, coordination during major launches, events and release deployments
7. Contribute to project work with some guidance to develop automation scripts, utilities and new processes to continuously improve the incident management process
8. Document details of incident response as needed to identify problems and improve overall incident management/response
9. Participate in post-incident RCA meetings as required
Required Qualifications:
10. Computer Science/IT Systems/Information Technology diploma or equivalent
11. 2+ years of Service Reliability Administration or equivalent role (System Analyst, System Administrator/Engineer, Live Operations, Network Administrator, NOC Engineer etc)
12. Experience with incident management and have good understanding of ITIL processes
13. Familiarity with the core concepts of operating systems, networking, SDLC and Agile methodologies
14. Good troubleshooting skills with triaging incidents in a high-capacity, high-availability and highly distributed environment
15. Experience with the following tools/platforms:
16. Monitoring solutions eg: Datadog, NewRelic, Nagios, Elastic Search, Grafana
17. Event management tools eg: BigPanda, Moogsoft
18. ITIL-based Ticketing systems eg: ServiceNow, JIRA
Desired Qualifications:
19. Computer Science/IT Systems/Information Technology degree or equivalent
20. Understand relational databases like MySQL, CI/CD pipelines, especially Jenkins
21. Experience working on deployments in a live environment is a plus
22. Experience working in container-based ecosystems like docker and with a container scheduler like Kubernetes, Amazon EKS/ECS or GKE
23. AWS Cloud Services experience/certification/training or equivalent, Linux+, Network+, or equivalents
24. Experience building automation scripts/utilities/jobs using either Python, Go, Powershell, JavaScript or Bash
25. Familiarity with Site Reliability Engineering (SRE) principles and best practices
For this role, you'll find success through craft expertise, a collaborative spirit, and decision-making that prioritizes your fellow Rioters, who are the customers of your work. Being a dedicated fan of games is not necessary for this position!
Our Perks:
We offer medical and dental plans that cover you, your spouse/domestic partner, and children. Life insurance, plus short-term and long-term disability coverage are also available. Riot will support your retirement benefits with access to a company pension, provide travel allowance, and double down on your donations of time and money to non-profit charitable organizations. Balance between work and personal life is encouraged with open paid time off, access to a spectrum of wellbeing activities and a play fund so you can broaden and deepen your personal relationship with games.
Let's Thrive Together:
Because together we are better
It's our policy to provide equal employment opportunities for all applicants and members of Riot Games, Inc. We know that fresh and varied perspectives will make us better at what we do, so however you identify and whatever background you bring with you, we’re excited to hear from you. Don’t be discouraged if you feel you don’t fully meet every single one of the requirements for a particular role, there’s always room for growth at Riot. If you spot a role that will make you want to jump out of bed in the morning, we are waiting to hear from you!