Lead the design, implementation, and testing of reliability-focused improvements to systems and processes. Identify and carry out improvements to automation, monitoring/alerting, and infrastructure.
Create, influence and review ongoing design, architecture, standards and methods for services and systems.
Write postmortems and lead incident analysis, with a focus on broad patterns and potential fixes.
Triage, mitigate, and resolve common incidents, and coordinate incident response for complex ones.
Work effectively with other Site Reliability Engineers (SREs), developers, and cross-functional teams. Assist in training new team members on operational procedures and best practices.
Minimum qualifications:
Bachelor’s degree in Computer Science, a related field, or equivalent practical experience.
3 years of experience with software development in one or more programming languages.
Experience in one or more of the following: C, C++, Java, Python or Go.
Preferred qualifications:
Master's degree in Computer Science or Engineering, or a related field.
Experience in analyzing and troubleshooting large-scale distributed systems, cloud computing, and large databases.
Knowledge of database internals and Google infrastructure.