Perks & benefits

Medical InsuranceHealth Insurance

Requirements

Bachelor's degree in Computer Science or equivalent, 8+ years building production software, Experience evaluating LLM-powered features, Proficiency in Python, Java, or Go, Experience with agentic frameworks, Experience with CI/CD and test infrastructure

Skills

PythonLLMRAG

About the role

Responsibilities

Define and lead the discipline of testing AI agents, evaluating LLM behavior, and ensuring the reliability of agentic systems in production.
Design and maintain evaluation pipelines for LLM outputs, agent behavior, tool use, and multi-turn interactions.
Build internal developer tooling and testing workflows to accelerate the development of AI features.
Instrument agentic systems for observability, monitoring for behavioral drift, hallucination rates, and policy adherence.
Lead agentic test strategies, including red-teaming, golden dataset construction, and LLM-as-judge pipelines.
Partner with Security, Platform, and Product teams to embed quality gates into agent development workflows.
Mentor senior and mid-level engineers on evaluation design and AI testing best practices.

Requirements

Bachelor's degree in Computer Science, Engineering, or equivalent experience.
8+ years of experience building and operating production software systems.
Demonstrated experience evaluating or testing LLM-powered features or autonomous agents in production.
Proficiency in Python, Java, or Go.
Experience with agentic frameworks (e.g., LangChain, LangGraph, CrewAI, or Anthropic SDK).
Experience designing test infrastructure, CI/CD quality gates, or evaluation pipelines at scale.
Experience with AI-assisted development tools (e.g., Claude Code, Cursor).

Preferred Qualifications

Background in identity verification, fraud detection, or regulated industries.
Familiarity with Anthropic's model evaluation methodology or similar research.
Experience with observability tooling (e.g., Datadog, OpenTelemetry) applied to AI workloads.
Proven track record of building developer platforms or tooling adopted widely across organizations.

About the Company

ID.me is a next-generation digital identity wallet that simplifies how individuals securely prove their identity online. With over 152 million users, ID.me provides streamlined identity verification for federal agencies, state governments, healthcare organizations, and hundreds of consumer brands. We are committed to the mission of "No Identity Left Behind," ensuring everyone has access to a secure digital identity.

Staff Software Engineer - AI Agent Evaluations

Perks & benefits

Requirements

Skills

About the role

Responsibilities

Requirements

Preferred Qualifications

About the Company

Get matches like this delivered daily

Staff Software Engineer - AI Agent Evaluations