Own the design, development, and deployment of scalable software components that enable the deployment of AI and ML infrastructure.
Troubleshoot complex distributed system issues across the stack (hardware, kernel, network); build the automation, tooling, and telemetry needed to turn operational findings into permanent software fixes and improved SLOs.
Collaborate closely with Hardware, Networking, Storage, CE, Product and other partner teams to define requirements and deliver high-quality solutions.
Lead code reviews, drive engineering best practices (testing, release safety), and mentor junior engineers to help grow the technical capability of the team.
Contribute to the team's technical roadmap by identifying infrastructure gaps and proposing architectural improvements to support future growth.
Minimum qualifications:
Bachelor’s degree or equivalent practical experience.
5 years of experience with software development in one or more programming languages.
3 years of experience testing, maintaining, or launching software products, and 1 year of experience with software design and architecture.
3 years of experience with ML infrastructure (e.g., model deployment, model evaluation, optimization, data processing, debugging).
Experience with distributed computing, infrastructure as code, infrastructure as a service, and system design.
Preferred qualifications:
Master's degree or PhD in Computer Science or related technical field.
5 years of experience with data structures and algorithms.
3 years of experience developing large-scale infrastructure, distributed systems or networks, or experience with compute technologies, storage or hardware architecture
Experience as a software engineer.
Experience in any one of GCP or other cloud providers, or other data center management stack.
Knowledge in three or more of the following areas: APIs and services, distributed systems, tools, testing infrastructure, and monitoring infrastructure.