Role Overview
As a
Principal AI Engineer
at MontyCloud, you will define and drive the technical vision for agentic AI systems powering the next generation of intelligent cloud operations. This role focuses on architecting scalable, production-grade AI systems, establishing engineering standards, mentoring senior engineers, and leading strategic technical initiatives across the organization. You will work at the intersection of AI, cloud infrastructure, and autonomous operations to build systems that are reliable, observable, and capable of operating at enterprise scale.
Key Responsibilities
Technical Leadership & Architecture
Define and own the technical vision for agentic AI systems across the platform
Architect scalable multi-agent systems, orchestration frameworks, MCP server infrastructure, retrieval and memory pipelines, and observability layers
Drive architectural decisions related to MCP/ tool ecosystems, AI platform design, and LLMOps infrastructure
Evaluate emerging AI technologies, frameworks, and models to influence engineering and product roadmaps
Create and maintain Architecture Decision Records (ADRs) and technical standards
Engineering & Delivery
Design and develop critical AI platform components and infrastructure
Establish AI engineering best practices and discipline across the organisation - design patterns, evaluation practices, prompt engineering, reliability standards, governance, and cost optimization
Lead cross-functional technical initiatives to improve AI system quality, reliability and scalability
Collaborate with platform, infrastructure, and data engineering teams to embed AI-driven automation into cloud operations workflows
Mentorship & Technical Community
Mentor Lead and Staff AI Engineers through architecture reviews, design discussions, and problem-solving sessions
Conduct rigorous technical reviews of designs, architectures, and major code contributions
Contribute to MontyCloud’s technical brand through technical writing, open-source contributions, or speaking engagements
Innovation & Strategic Impact
Identify opportunities where agentic AI can create significant product or operational improvements
Build prototypes, technical proposals, and proof-of-concepts to validate new ideas
Stay current with advancements in AI research, agentic frameworks, and LLMOps practices
Desired Skills and Requirements
Must Have
Agentic AI & Multi-Agent Systems
Production-grade agentic AI system design and development
Agentic AI System Design & Architecture - Multi-agent architectures and orchestration, Agent-to-agent communication, Agent memory and planning strategies, Tool integration and MCP server design
Agent orchestration frameworks - LangGraph, Strands Agents, CrewAI, AutoGen, or equivalent agentic AI frameworks
LLMOps & AI Platform Engineering
AI Governance & Lifecycle Management - Prompt versioning and governance, evaluation frameworks, regression detection
AI Observability & Monitoring - Output quality monitoring, Agent tracing and observability
AI Cost Management - Cost governance for high-scale AI workloads
Cloud & Infrastructure
Cloud AI Platforms & Services - AWS cloud ecosystem, AWS Bedrock, AgentCore
Cloud-Native Infrastructure & Deployment - Cloud-native AI deployments, Kubernetes, Docker
Infrastructure as Code (IaC) - Terraform
Foundation Models & AI Integrations
Foundation model API integration - OpenAI, Anthropic, AWS Bedrock, Azure OpenAI, Hugging Face
MCP and AI tool integration architecture
RAG & Knowledge Systems
Retrieval-Augmented Generation (RAG) and Graph-RAG architectures
Embedding strategies, retrieval and reranking systems
Knowledge graph integrations
Technical Leadership and Communication
Cross-team technical influence
Technical communication and documentation
Organization-level engineering ownership
Proactive problem identification and resolution
Good to Have
Domain Experience
AI systems for cloud operations and infrastructure automation
Developer tooling platforms
AI Deployment & Optimization
Serverless AI deployment patterns
AI inference cost optimization
Advanced AI Techniques Exposure
Model fine-tuning and RLHF
Advanced model evaluation techniques
Industry & Community Exposure
AI-first or cloud-native product company
Open-source contributions, technical blogs, conference talks, or published research in AI/agentic systems
Experience
12+ years of overall software engineering experience
Prior experience in a Principal Engineer role or equivalent individual contributor (IC) role
Significant recent hands-on experience building and deploying applied AI systems in production environments
Proven track record of leading large-scale technical initiatives across multiple teams or product areas
Demonstrated expertise in architecting enterprise-scale AI platforms and cloud-native AI workloads
Experience mentoring senior engineers and influencing technical strategy at an organizational level
Education
Bachelor’s or Master’s degree in Computer Science / Artificial Intelligence / Machine Learning / Engineering / or any related technical discipline
Equivalent practical experience in advanced AI system design and distributed cloud platforms may also be considered