Requirements

3–6 years in AI/ML, NLP, or model evaluation, Understanding of LLM architectures and prompt engineering, Hands-on with Ragas, OpenAI Evals, or DeepEval, Proficiency in Python, Experience with LangChain, LangGraph, or LlamaIndex, Experience with vector databases and RAG pipelines

Skills

PythonLLMNLPLangChainRAG

About the role

Responsibilities

Design and implement LLM evaluation pipelines covering accuracy, robustness, safety, and bias
Develop automated systems for benchmarking models on enterprise-relevant tasks
Conduct stress tests, adversarial testing, and edge-case evaluations
Build tools to measure latency, consistency, and error recovery in multi-turn interactions
Define KPIs such as factual accuracy, hallucination rate, toxicity, and compliance alignment
Establish real-time monitoring for drift, anomalies, and performance regressions
Partner with ML engineers and product managers to align evaluation with business objectives
Feed evaluation insights into fine-tuning, RLHF/RLAIF pipelines, and model selection

Requirements

3–6 years of experience in AI/ML, NLP, or applied model evaluation
Strong understanding of LLM architectures, prompt engineering, and failure modes
Hands-on experience with evaluation frameworks such as Ragas, OpenAI Evals, or DeepEval
Proficiency in Python and libraries including LangChain, LangGraph, LlamaIndex, or Hugging Face
Experience with vector databases, RAG pipelines, and knowledge graph integration
Familiarity with bias/fairness testing and Responsible AI frameworks

Preferred Qualifications

Experience with reinforcement learning (RLHF, RLAIF) and reward modeling
Exposure to agentic evaluation frameworks and multi-agent stress testing
Knowledge of compliance and safety requirements for BFSI, GRC, or SOC use cases
Contributions to open-source evaluation libraries or research papers

About the Company

XenonStack is a fast-growing Data and AI Foundry for Agentic Systems. We enable enterprises to gain real-time and intelligent business insights by making AI agents reliable, explainable, and enterprise-ready. Our mission is to accelerate the world’s transition to AI + Human Intelligence through cutting-edge platforms in Vision AI and Inference AI infrastructure.

LLM Reliability & Evaluation Engineer

Requirements

Skills

About the role

Responsibilities

Requirements

Preferred Qualifications

About the Company

Get matches like this delivered daily

LLM Reliability & Evaluation Engineer