Requirements

BS, MS, or PhD in Computer Science or Machine Learning, 2+ years experience in ML evaluations or data curation, Strong Python and PyTorch development skills, Experience with LLM agents, Experience with RL or distributed ML systems

Skills

PythonPyTorchLLMReinforcement LearningMachine Learning

About the role

Responsibilities

Design and run evaluations of agentic capabilities, including multi-step reasoning, tool use, long-horizon planning, and safety properties.
Build and harden evaluation harnesses to ensure benchmarks run reliably at scale against training checkpoints.
Source, generate, and curate high-quality agentic training data, such as trajectories and tool-use traces.
Design and scale RL environments and reward signals to improve model performance.
Develop QA frameworks to detect reward hacking, label noise, and data contamination.
Partner with research and product teams to translate capability goals into measurable data and evaluation artifacts.
Contribute to technical reports, research publications, and open-source benchmarks and tooling.

Requirements

BS, MS, or PhD in Computer Science, Machine Learning, or a related field.
2+ years of experience with a clear emphasis on ML evaluations or training-data curation.
Strong Python and PyTorch development skills.
Demonstrated experience designing evaluations or curating/generating training datasets.
Hands-on experience using LLM agents in professional or personal projects.

Preferred Qualifications

Experience with reinforcement learning (RL), reward design, or RL environment construction for LLMs.
Background in statistics and experimental design, specifically regarding signal-to-noise and contamination.
Experience with large-scale dataset sourcing and managing external data vendors.
Experience building or operating scalable data pipelines and evaluation infrastructure (e.g., Ray).
Experience evaluating or generating data for software-engineering or computer-use agents.
Contributions to published research or open-source ML software.

About the Company

The Institute of Foundation Models (IFM) is a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance research, nurture the next generation of AI builders, and drive transformative contributions to a knowledge-driven economy.

Research Scientist, Agentic Data & Benchmarking

Requirements

Skills

About the role

Responsibilities

Requirements

Preferred Qualifications

About the Company

Get matches like this delivered daily

Research Scientist, Agentic Data & Benchmarking