At Guidewire, we make software that offers Property and Casualty (P&C) Insurance companies the tools to take care of their customers when they need it the most, whether that’s a time of crisis, a natural disaster, an accident, or exposure to cyber risks. We build the core applications that insurance companies use to sell and underwrite policies, settle claims, and bill their customers. We also have a portfolio of innovative products serving the needs of P&C insurance companies in areas such as data management, digital online portals, and predictive analytics. We run these products on the Guidewire Cloud Platform, and we help hundreds of insurance providers all over the world to handle billions of dollars of business.
We are proud to be voted a Top Cloud Employer on Glassdoor by our own employees and positioned as a market leader by industry experts like Gartner. We have a fun work environment and a culture that lives by our core values of integrity, rationality, and collegiality.
We’re searching for people who are as passionate about working together to deliver quality products and support as we are. Join us and enjoy a career where you can make an impact. You’ll be inspired by those around you, and you’ll be trusted and empowered to go further. As an Embedded Site Reliability Engineer at Guidewire, you will be a key member of a team dedicated to enhancing the reliability and performance of our cloud-based platform and services. Collaborating closely with platform engineering and product development teams throughout all phases of the software development lifecycle - from design to deployment and operations - you will build, deploy, and maintain highly available, fault-tolerant systems that efficiently serve our customers. Your contributions will directly enhance the reliability of Guidewire's cloud platform and customer-facing products, ensuring they meet essential functional and non-functional requirements, including availability, scalability, and observability. This role emphasizes collaboration, ownership, and accountability, as you serve as a vital link between product development and operations. You will define and implement Service Level Objectives (SLOs), establish robust monitoring and alerting practices, and mentor peers on all aspects of reliability. If you thrive in a fast-paced environment and are passionate about automating processes, improving system reliability, and tackling complex challenges, we want to hear from you. The ideal candidate embodies the philosophy of "automate everything" and is eager to learn and adapt to new technologies.
ESSENTIAL DUTIES AND RESPONSIBILITIES
* Collaborate with development teams to enhance the reliability and efficiency of the Guidewire Cloud Platform (GWCP) and platform services.
* Partner with our platform engineering teams to support the design and implementation of highly available and fault-tolerant systems; from the early stages of development through to deployment and operations. Serve as the primary SRE liaison and reliability consultant within engineering teams.
* Actively guide and contribute to service and tool development by writing code, automating processes, and improving reliability, while conducting code reviews to ensure best practices for scalability and maintainability.
* Work with teams on infrastructure improvements and system design, optimizing performance and scalability while integrating monitoring and alerting.
* Define and implement SLIs, SLOs, and Error Budgets, ensuring systems adhere to agreed-upon reliability standards.
* Establish and refine observability, monitoring, and alerting practices to ensure systems are operating as expected.
* Ensure services are fully prepared for incident management, leading response efforts and post-incident reviews to identify gaps and drive continuous improvement.
* Advocate for "reliability as a feature", embedding best practices into the development process.
* Conduct production readiness reviews, focusing on performance, monitoring, and fault tolerance, while collaborating with stakeholders on operability requirements.
* Mentor and coach engineers on operational best practices, including capacity planning, disaster recovery, and observability.
* Facilitate blameless postmortems, retrospectives, and technical discussions focused on improving system reliability and availability.
* Create and standardize best practices for monitoring, alerting, incident response, and operational procedures.
* Maintain a centralized repository of operational best practices and documentation, ensuring it is standardized, updated, and accessible for continuous improvement.
* Capture and share lessons learned and optimizations to promote continuous learning across engineering teams.
* Establish efficient feedback loops between product development and SRE, ensuring ongoing knowledge transfer and regular status updates to SRE groups.
* Stay up-to-date with the latest tools, technologies, and trends in SRE, DevOps, and cloud infrastructure to introduce new ideas and practices.
Required Technical Skills:
* Bachelor’s Degree in Computer Science or a related field, or equivalent demonstrable experience in a relevant technical role.
* Proficiency in software engineering and automation using Bash, Java, Go, and/or Python, including experience with writing unit tests, integration tests, and automated test frameworks to ensure code quality and reliability.
* Strong background in Linux systems engineering and administration.
* Extensive experience working with cloud environments (preferably AWS) for engineering and automation, with multi-cloud experience as a plus.
* Experience developing and/or supporting microservices architecture in production environments at scale.
* Proven expertise in using Infrastructure-as-Code (IaC) tools to automate and manage infrastructure, with experience in technologies like Crossplane, Terraform, or similar.
* Hands-on experience with DevOps/GitOps tools for managing CI/CD pipelines and automating deployments, preferably with tools such as Git, GitHub, ArgoCD, FluxCD, and TeamCity for gate promotion and production readiness.
* In-depth knowledge of containerization technologies such as Docker, Helm, Kubernetes (EKS), and networking (CNI, Ingress).
* Comprehensive experience with observability tools for logging, metrics, distributed tracing, and performance monitoring, including setting up alerting systems, creating real-time dashboards, and analyzing logs to identify and resolve performance issues.
Desired Technical Skills:
* Familiarity with the Agile software development lifecycle.
* Adept in software development principles such as object oriented programming, functional programming and event driven architectures.
* Proficient in modern software development frameworks and tools for distributed systems and microservices (e.g. Spring Boot, Kubernetes Operators). Expertise with Git and version control strategies for the effective management of large-scale codebases.
* Experience collaborating with or working directly within product development teams on large-scale codebases to embed reliability, scalability, and performance into the software development lifecycle.
* Familiarity with security best practices for cloud environments, including identity and access management (IAM) and data protection strategies.
* Experience with relational databases such as Aurora PostgreSQL and Oracle RDS.
* Strong understanding of Single Sign-On (SSO), SAML, and OAuth (Okta experience is a plus). Experience with x.509 certificates and encryption technologies.
* Advanced knowledge of Web UI design, JSON, and application architecture.
* Familiarity with event-driven and stream-processing systems like Kafka or AWS SQS.
* Understanding of Open Application Model (OAM) systems such as KubeVela or Crossplane.
* Experience managing multi-cluster Kubernetes environments, including workload distribution, scaling, and maintaining consistent configurations across clusters.
Personal Qualities & Soft Skills:
* Exceptional communication skills, capable of clearly articulating technical concepts to diverse audiences, both technical and non-technical.
* Passion for mentoring others and fostering a culture of reliability through cross-team collaboration and knowledge sharing.
* Ability to build relationships and influence stakeholders at all levels to drive initiatives, foster collaboration and influence without authority.
* Outstanding troubleshooting skills; ability to think critically and display an aptitude for problem solving.
* Strong analytical mind with a penchant for process development and enhancement. A highly positive can-do attitude with desire for being a team player.
* Ability to work independently and proactively identify and address challenges while also thriving in collaborative team environments.
* Strong work ethic with a focus on follow-through, consistently meeting commitments and delivering quality results.
Other Requirements:
* Ability to read, write, and speak English.
* We provide 24x7 support to our customers, so we expect you to take turns with your teammates being on-call for weekend production emergencies or to provide rotating weekend operational support.
* Travel – Expect occasional travel (less than 5%) to other Guidewire offices for training and team meetings.
About Guidewire
Guidewire is the platform P&C
#J-18808-Ljbffr