Troubleshoot technical issues, evaluate data, and develop recommendations for systems and services within domain, respond to tickets within team-defined SLOs, and contribute to systems and services in related domains through bug reports and consultation.
Participate in limited on-call rotation and monitoring of production distributed computing infrastructure.
Manage the GCP project environment and VM provisioning service, ensure workstation health through Terraform debugging and adherence to established turn-up playbooks, work with customers on defining distributed systems requirements, testing procedures, proposing solutions.
Implement proactive monitoring, canary update processes, and fleet-wide safeguards to prevent systemic failures and accidental destructive actions across the infrastructure. Provide technical troubleshooting for cloud workstations, resolve issues with access, CLI, Puppet configurations, mounts (NFS/Filestore), and disk limitations.
Scale systems sustainably through mechanisms like automation. Evolve systems by pushing for changes that improve reliability/ velocity.
Minimum qualifications:
Bachelor's degree in Computer Science or IT-related field, or equivalent practical experience.
3 years of experience with Linux operating systems internals and administration, technical infrastructure (e.g., deployment, maintenance, troubleshooting), and with reliability of technical infrastructure.
3 years of experience with one or more programming, or scripting languages (Go, Python).
3 years of experience with technical infrastructure (e.g., deployment, maintenance, troubleshooting).
Experience in managing infrastructure and servers through virtualization, automation and deployments, configuration management, networking and security.
Preferred qualifications:
3 years of experience in cloud systems design.
Understanding of technology, with the ability to deliver exceptional user experiences.
Excellent communication, problem-solving, and presentation skills.