The Squarespace Incident Response & Observability team is looking for a Senior Software Engineer to lead the automation & experimentation efforts for detection, monitoring, and mitigation across Squarespace-powered systems, to protect our Customers from product and service degradations, incidents and outages, and empower our engineering staff with the self-service Observability toolkit to gain insights into our tech ecosystem, to be equipped to detect and triage.
Our team mission is to overhaul a Business Continuity program to standardize processes & workflows, mitigate risks, collect data insights from incident trends affecting business-critical uptime metrics, prepare and communicate Incident reports for a broad audience from individual contributors to C-suite executives, measure Incident frequency & volume, downtime costs, and security threats, and improve Incident Service Level Agreement metrics to our contractual commitments.
You will promote an accountability model for performance, availability & uptime Indicators that help increase resiliency to Incident Response.
This is an opportunity to build real-time dashboards, outlining all business-critical uptime system event signals (i.e. SLAs, SLIs & SLOs) powering the mission to unlock 1 million Monthly Active Sellers.
You will work with a diverse group, including Product Engineering, Infrastructure, Platform Engineering, Product Specialists, Customer Operations, Security, Legal, Enterprise, Data Science, Product Analytics, UX, and organizational leaders.
As a Senior Software Engineer, you are empowered to construct the foundational layer, including the design, implementation, and maintenance of systems & tools to guide and improve Incident Response & Observability at Squarespace.
This is a hybrid role working from our Dublin office 3 days per week. You will report to the Engineering Director.
You'll Get To…
1. Develop incident alerts & observability automation, conduct analysis, create health metrics, lead investigations, and provide advisory support. Automate processes such as system & network log analysis to re-assemble and replay incident event history for root cause analysis & impact costs
2. Design and conduct tabletop exercises to assure organizational readiness in disaster recovery and business continuity program
3. Establish processes and build play-book document catalog and implement strategy around operational responses to incidents, and to protect our customers and Squarespace
4. Manage and contribute efforts to build the next generation Metrics Platform in the Cloud
5. Build / refine our Observability tools that support hundreds of engineers every day
6. Refine the Incident Commander processes and Incident Management training
Who We're Looking For
7. BS in Computer Science or Engineering, or equivalent professional experience
8. Have 8+ years of demonstrated experience as an engineer
9. Proficiency in at least 1 general purpose programming or scripting language (i.e. Golang)
10. In-depth technical understanding to assess incident risks & significance across broader tech ecosystem
11. Regular on-call rotation expectations
Benefits & Perks
12. Health insurance with 100% covered premiums for you and your dependent children
13. Fertility and adoption benefits
14. Headspace mindfulness app subscription
15. Retirement benefits with employer match
16. Flexible paid time off
17. Up to 20 weeks of paid family leave
18. Equity plan for all employees
19. Commuter benefit in the form of reduced tax
20. Education reimbursement
21. Employee donation match to community organizations
22. 6 Global Employee Resource Groups (ERGs)
23. Free lunch and snacks
24. Close proximity to cultural landmarks such as Dublin Castle and St. Patrick's Cathedral