Machine Learning Engineer — Inference Optimization

This listing is synced directly from the company ATS.

Role Overview

This senior-level Machine Learning Engineer role focuses on optimizing inference performance for large-scale ML models in production. Day-to-day responsibilities include profiling GPU/CPU pipelines, implementing techniques like quantization and speculative decoding, and building inference-serving systems such as Triton. The hire will have direct ownership over performance-critical systems, collaborating with research and infra teams to impact product reliability and cost efficiency in a fast-paced startup environment.

Perks & Benefits

The role is fully remote, likely with flexible time zones given the startup context, and offers competitive compensation plus meaningful equity at Series A. It provides real ownership over systems, direct impact on product and economics, and close collaboration with research, infra, and product teams. The culture emphasizes engineering quality over hype, with opportunities for career growth through hands-on technical work and ambiguity in a dynamic setting.

Full Job Description

About the Role

We’re looking for a Machine Learning Engineer to own and push the limits of model inference performance at scale. You’ll work at the intersection of research and production—turning cutting-edge models into fast, reliable, and cost-efficient systems that serve real users.

This role is ideal for someone who enjoys deep technical work, profiling systems down to the kernel/GPU level, and translating research ideas into production-grade performance gains.

What You’ll Do

  • Optimize inference latency, throughput, and cost for large-scale ML models in production

  • Profile and bottleneck GPU/CPU inference pipelines (memory, kernels, batching, IO)

  • Implement and tune techniques such as:

    • Quantization (fp16, bf16, int8, fp8)

    • KV-cache optimization & reuse

    • Speculative decoding, batching, and streaming

    • Model pruning or architectural simplifications for inference

  • Collaborate with research engineers to productionize new model architectures

  • Build and maintain inference-serving systems (e.g. Triton, custom runtimes, or bespoke stacks)

  • Benchmark performance across hardware (NVIDIA / AMD GPUs, CPUs) and cloud setups

  • Improve system reliability, observability, and cost efficiency under real workloads

What We’re Looking For

  • Strong experience in ML inference optimization or high-performance ML systems

  • Solid understanding of deep learning internals (attention, memory layout, compute graphs)

  • Hands-on experience with PyTorch (or similar) and model deployment

  • Familiarity with GPU performance tuning (CUDA, ROCm, Triton, or kernel-level optimizations)

  • Experience scaling inference for real users (not just research benchmarks)

  • Comfortable working in fast-moving startup environments with ownership and ambiguity

Nice to Have

  • Experience with LLM or long-context model inference

  • Knowledge of inference frameworks (TensorRT, ONNX Runtime, vLLM, Triton)

  • Experience optimizing across different hardware vendors

  • Open-source contributions in ML systems or inference tooling

  • Background in distributed systems or low-latency services

Why Join Us

  • Real ownership over performance-critical systems

  • Direct impact on product reliability and unit economics

  • Close collaboration with research, infra, and product

  • Competitive compensation + meaningful equity at Series A

  • A team that cares about engineering quality, not hype

Similar jobs

Found 6 similar jobs

Browse more jobs in: