Role Overview
We are seeking a seasoned Project Director – HPC / AI Infrastructure Deployment to lead large-scale, high-density compute programs involving GPU clusters, HPC workloads, and AI infrastructure. The role demands end-to-end ownership of deploying 10+ MW IT load data center environments, ensuring delivery of high-performance GPU-based compute platforms with cutting-edge networking and storage architectures.
Roles & Responsibilities
Lead and deliver large-scale HPC / AI GPU cluster deployments (e.g., NVIDIA B200 / B300 GPU platforms) within defined timelines and budgets
Drive execution of AI stack deployment (e.g., NVIDIA NVAIE) across hybrid/cloud/on-prem environments
Manage multi-vendor ecosystems including OEMs, SI partners, and hyperscale technology providers
Deploy and scale high-density GPU racks with liquid/air-cooled thermal strategies
Design and oversee InfiniBand (IB) and high-speed Ethernet networks
Experience with NVIDIA/Mellanox InfiniBand fabrics
Configuration and optimization using UFM (Unified Fabric Manager)
Strong understanding of BCM (Broadcom Ethernet switching) platforms
Architect and implement Leaf-Spine network topology for ultra-low latency AI workloads
Ensure effective integration of storage systems (parallel file systems, NVMe-based storage)
Oversee deployment of Kubernetes-based GPU orchestration platforms
Experience with containerized AI workloads and distributed training clusters
Exposure to NVIDIA AI Enterprise (NVAIE), CUDA, and GPU virtualization frameworks
Manage data center design, build, and repurposing for HPC workloads
Oversee MEP (Mechanical, Electrical, Plumbing) systems implementation
Enure optimized thermal management (liquid cooling, rear door heat exchangers, immersion cooling where applicable)
Ensure optimized power density (kW/rack) planning
Ensure optimized energy efficiency (PUE optimization)
Establish robust governance frameworks aligned to:
a. HLD/LLD design validation
b. SOP adherence
c. Quality assurance benchmarks
Implement risk mitigation strategies for large-scale deployments (supply chain, OEM dependencies, technology integration risks)
Monitor program milestones and ensure SLA-based deliveries
Drive structured cabling design (fiber-heavy HPC fabric, spine-leaf connectivity)
Qualifications & Experience
B.E/B.Tech in Electrical / Electronics / Computer Science Engineering
15–25 years of experience in Data center infrastructure deployment, HPC / AI workload environments, large-scale IT infrastructure programs
Mandatory / Preferred Certifications
PMP / PRINCE2 (mandatory for program governance)
CDCP / CDCS / CDCPM certifications
Strongly preferred:
NVIDIA AI Infrastructure / DGX / AI Factory certifications
OEM certifications (Dell, HPE, Lenovo HPC systems)