Requirements

Bachelor's, Master's, or PhD in CS, CE, Math, or related, 8+ years large-scale software engineering experience, Proficiency in Python, Advanced proficiency in C++, Experience with LLM serving frameworks (vLLM, SGLang, TensorRT-LLM), Hands-on experience with PyTorch

Skills

PythonC#PyTorchLLMMachine Learning

About the role

Responsibilities

Drive and provide technical guidance to a team of software engineers working on complex machine learning integration projects.
Design and implement ML features such as structured outputs, biased sampling, and predicted outputs to improve generative AI performance.
Architect and implement high-throughput, low-latency multimodal inference models supporting image, audio, and video.
Maintain and scale the serving backend to handle high volumes of concurrent requests.
Optimize software to accelerate LLM inference, focusing on latency, throughput, memory usage, and compute efficiency.
Implement detailed observability throughout the stack to scale inference services.
Lead cross-functional initiatives to deliver high-quality inference solutions and manage technical debt.
Build and maintain robust automated test suites to ensure software quality and reliability.

Requirements

Bachelor’s, Master’s, or PhD in Computer Science, Computer Engineering, Mathematics, or a related field.
8+ years of experience in large-scale software engineering, specifically focused on deep learning or related domains.
Advanced proficiency in C++, including multi-threaded programming and performance optimization.
Proficiency in Python for building and maintaining scalable systems.
Hands-on experience with ML frameworks such as PyTorch and a strong understanding of their architectures.
Experience building and scaling large-scale inference systems for LLMs or multimodal models.
Familiarity with LLM serving frameworks like vLLM, SGLang, or TensorRT-LLM.

About the Company

Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs. Our novel wafer-scale architecture provides the AI compute power of dozens of GPUs on a single chip, delivering industry-leading training and inference speeds. Cerebras Inference offers the fastest Generative AI inference solution in the world, transforming the user experience of AI applications through unprecedented performance and scalability.

Staff Inference ML Runtime Engineer

Requirements

Skills

About the role

Responsibilities

Requirements

About the Company

Get matches like this delivered daily

Staff Inference ML Runtime Engineer