
Posted a day ago
Staff Inference ML Runtime Engineer
Cerebras SystemsStaff Inference ML Runtime Engineer
Requirements
Bachelor's, Master's, or PhD in CS, CE, Math, or related, 8+ years large-scale software engineering experience, Proficiency in Python, Advanced proficiency in C++, Experience with LLM serving frameworks (vLLM, SGLang, TensorRT-LLM), Hands-on experience with PyTorch
Skills
PythonC#PyTorchLLMMachine Learning
About the role
Responsibilities
- Drive and provide technical guidance to a team of software engineers working on complex machine learning integration projects.
- Design and implement ML features such as structured outputs, biased sampling, and predicted outputs to improve generative AI performance.
- Architect and implement high-throughput, low-latency multimodal inference models supporting image, audio, and video.
- Maintain and scale the serving backend to handle high volumes of concurrent requests.
- Optimize software to accelerate LLM inference, focusing on latency, throughput, memory usage, and compute efficiency.
- Implement detailed observability throughout the stack to scale inference services.
- Lead cross-functional initiatives to deliver high-quality inference solutions and manage technical debt.
- Build and maintain robust automated test suites to ensure software quality and reliability.
Requirements
- Bachelor’s, Master’s, or PhD in Computer Science, Computer Engineering, Mathematics, or a related field.
- 8+ years of experience in large-scale software engineering, specifically focused on deep learning or related domains.
- Advanced proficiency in C++, including multi-threaded programming and performance optimization.
- Proficiency in Python for building and maintaining scalable systems.
- Hands-on experience with ML frameworks such as PyTorch and a strong understanding of their architectures.
- Experience building and scaling large-scale inference systems for LLMs or multimodal models.
- Familiarity with LLM serving frameworks like vLLM, SGLang, or TensorRT-LLM.
About the Company
Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs. Our novel wafer-scale architecture provides the AI compute power of dozens of GPUs on a single chip, delivering industry-leading training and inference speeds. Cerebras Inference offers the fastest Generative AI inference solution in the world, transforming the user experience of AI applications through unprecedented performance and scalability.
ScoutJobs Agent
Get matches like this delivered daily
Sign up free — we'll pull jobs that fit your CV from across the web and rank them for you.
Get started — it's freeStaff Inference ML Runtime Engineer
Cerebras Systems · Sunnyvale
