Role Overview
As a Staff AI Engineer at MontyCloud, you will design, build, and operate production-grade agentic AI systems powering intelligent cloud operations at scale. This role focuses on developing scalable AI agents, orchestration pipelines, and cloud-native AI infrastructure while contributing to engineering standards, reliability, and operational excellence across the platform. You will work at the intersection of AI, cloud infrastructure, and autonomous operations to deliver systems that are reliable, observable, secure, and production-ready.
Key Responsibilities
Engineering & Delivery
Architect, build, and operate production-grade AI agents and multi-agent systems for cloud management use cases
Design and own AI inference and orchestration pipelines optimized for scalability, latency, reliability, and cost efficiency
Build safety and reliability guardrails for autonomous AI systems operating on live cloud infrastructure
Develop human-in-the-loop workflows, rollback strategies, scoped permissions, and audit mechanisms for AI-driven operations
Collaborate with platform, infrastructure, and data engineering teams to embed AI-native automation into cloud management workflows
Standards & Technical Quality
Implement observability and monitoring for agentic systems, including agent tracing, MCP interaction auditing, output quality monitoring, and cost governance
Contribute to engineering standards for agentic design patterns, agentic AI architectures, MCP server and tool design, Prompt engineering, RAG and Graph-RAG pipelines, LLMOps practices, and Foundation model integrations.
Conduct rigorous technical reviews of AI architectures, systems, and features to improve engineering quality and reliability
Document technical decisions, trade-offs, and implementation patterns clearly for broader engineering adoption
Innovation & Opportunity Identification
Identify opportunities where agentic AI can improve product capabilities or operational efficiency
Build proof-of-concepts and prototypes to validate technical feasibility and scalability
Evaluate emerging AI technologies, LLMs, multi-modal models, and agentic frameworks for adoption suitability
Stay current with advancements in agentic AI, orchestration frameworks, and production AI engineering practices
Desired Skills and Requirements
Must Have
Agentic AI & Multi-Agent Systems
Production-grade agentic AI system design and development
Agentic AI system design & architecture - Multi-agent architectures and orchestration, Agent-to-agent communication, Agent memory and planning strategies, Tool integration and MCP server design
Agent orchestration frameworks - LangGraph, Strands Agents, CrewAI, AutoGen, or equivalent agentic AI frameworks
LLMOps & AI Platform Engineering
AI Governance & Lifecycle Management - Prompt versioning and governance, evaluation frameworks, regression detection
AI Observability & Monitoring - Output quality monitoring, Agent tracing and observability
AI Cost Management - Cost governance for high-scale AI workloads
Cloud & Infrastructure
Cloud AI Platforms & Services - AWS cloud ecosystem, AWS Bedrock, AgentCore
Cloud-Native Infrastructure & Deployment - Cloud-native AI deployments, Kubernetes, Docker
Infrastructure as Code (IaC) - Terraform
Foundation Models & AI Integrations
Foundation model API integration - OpenAI, Anthropic, AWS Bedrock, Azure OpenAI, Hugging Face
MCP and AI tool integration architecture
RAG & Knowledge Systems
Retrieval-Augmented Generation (RAG) and Graph-RAG architectures
Embedding strategies, retrieval and reranking systems
Knowledge graph integrations
Technical Leadership and Communication
Cross-functional technical collaboration and engineering ownership
Technical communication and documentation
Architecture reviews and technical decision-making
Proactive problem identification and solution development
Good to Have
Domain Experience
AI systems for cloud operations and infrastructure automation
Developer tooling platforms
AI Deployment & Optimization
Serverless AI deployment patterns
AI inference cost optimization
Advanced AI Techniques Exposure
Model fine-tuning and RLHF
Advanced model evaluation techniques
Multi-modal AI systems
Industry & Community Exposure
Experience working in AI-first or cloud-native product companies
Open-source contributions, technical blogs, conference talks, or published research in AI/agentic systems
Experience
8+ years of overall software engineering experience
Hands-on experience in building and operating applied AI systems in production environments
Experience designing and deploying agentic AI systems and orchestration pipelines
Proven ability to lead complex technical implementations across engineering teams
Experience deploying and operating AI workloads in cloud-native environments
Strong engineering judgment across scalability, reliability, security, and operational excellence
Education
Bachelor’s or Master’s degree in Computer Science / Artificial Intelligence / Machine Learning / Engineering / or any related technical discipline
Equivalent practical experience in advanced AI system design and distributed cloud platforms may also be considered