resu·mail

ML Engineer (Data), Foundational Models

at Sarvam

Bengaluru, India Mid Posted 2026-05-21

Don't apply into the void — reach the hiring manager

ResuMail finds the recruiters and hiring managers behind this ML Engineer (Data), Foundational Models role at Sarvam, drafts a personalised outreach email, and schedules the send — so your application actually gets seen.

Reach the hiring manager ›

About this role

About Sarvam Sarvam is building the bedrock of Sovereign AI for India. The company is developing India’s full-stack sovereign AI platform, building across research, models, infrastructure and applications with a singular focus on making AI genuinely work for India. Sarvam works with leading enterprises and public institutions and is backed by Lightspeed, Peak XV, and Khosla Ventures. Sarvam partners with India’s leading brands, including Tata Capital, SBI Life, CRED, IDFC, and LIC. About the Role You will own the data infrastructure that feeds our next family of foundational models. This means building petabyte-scale curation and filtering pipelines, designing the systems that decide what goes into a training run and in what proportion, and treating data quality with the same rigor a research team would treat an architectural choice. This is not a glue-code role. The data work at a serious pretraining lab is engineering- and research-heavy: deduplication at scale, quality models, contamination detection, mixture design, curriculum and annealing, attribution and debugging. You should care deeply about all of it. What You’ll Do Design and build large-scale data pipelines for pre-training and post-training — ingestion, parsing, normalization, filtering, deduplication, tokenization, and packing — at petabyte scale. Develop and continually improve quality filtering systems, including model-based quality classifiers and contamination detection. Own data mixture design, curriculum, and annealing strategies in partnership with the research team. The question "what data did this model see, in what proportion, in what order" should always have a precise answer because of work you did. Build the tooling that lets researchers and engineers analyze, slice, attribute, and debug the data. Scale the pipeline to handle multilingual corpora, code, math, multi-source web data, and licensed datasets, while keeping provenance and licensing tracked end-to-end. Partner with the training infrastructure team so that data is never the bottleneck of a production training run What We’re Looking For BS or MS in Computer Science or a closely related technical field (or equivalent demonstrated experience). 3+ years of experience building large-scale data systems — petabyte-scale processing, distributed data pipelines, or comparable. Exceptional early-career candidates with a strong systems background will be considered. Hands-on experience with data curation and filtering for LLM training. You should be able to walk through a pre-training corpus you helped build, end to end, and defend the choices that went into it. Deep familiarity with distributed data processing frameworks — Spark, Ray, Beam, Dask, or equivalent — and the storage systems that sit underneath them. Strong Python; comfort with the low-level pieces of the data path (tokenization, sharding, packing, IO patterns) and the performance tradeoffs they imply. Meaningful open-source contributions in the data tooling ecosystem — datasets, dedup libraries, filtering frameworks, or substantive work on widely-used open data releases. Bonus Points Direct experience building or working with large open pretraining corpora. Work on multilingual data — collection, normalization, quality scoring, and mixing across many languages. Hands-on experience with model-based data quality classifiers, contamination detection, or data attribution research. Familiarity with tokenization research and the practical implications of tokenizer choices on training. First-author papers or technical reports on data curation, quality, or pretraining mixtures. Why this role? The frontier is moving towards data being the dominant lever in model quality, and the labs that get this right will define the next generation of models. You will be the person at Sarvam most responsible for that lever. Why Sarvam? Sarvam is a fast-moving, high talent-density team building full-stack AI for India, working on problems that push the frontiers of AI with real population-scale impact. Work alongside researchers, engineers, builders, and business leaders who move fast and hold each other to a very high bar High ownership and high impact, from day one Everything we do is AI-first, from the way we build and ship to the way we think about problems You can work on problems that could change how an entire country learns, works, and communicates If you want to work on problems at the frontier of AI in India, Sarvam is the place to be.

How to get this job at Sarvam

  1. Don't rely on the portal. Cold applications for a role like ML Engineer (Data), Foundational Models land in a pile of hundreds. A direct, personalised message to the hiring manager or a referrer is the fastest way in.
  2. Find the right person. ResuMail surfaces the actual recruiters and hiring managers at Sarvam — not a generic careers inbox.
  3. Send tailored outreach. ResuMail drafts an email personalised to your resume and this role, then paces and schedules sends so you stay out of spam.
  4. Follow up. One polite nudge after 5–7 days roughly doubles reply rates — scheduled for you.

Reach Sarvam's hiring managers today.

Free to start. No credit card. Built for Indian job seekers.

Start free with ResuMail ›