Machine Learning Engineer — Inference Optimization
Role Overview
This senior-level Machine Learning Engineer role focuses on optimizing inference performance for large-scale ML models in production. Day-to-day responsibilities include profiling GPU/CPU pipelines, implementing techniques like quantization and speculative decoding, and building inference-serving systems such as Triton. The hire will have direct ownership over performance-critical systems, collaborating with research and infra teams to impact product reliability and cost efficiency in a fast-paced startup environment.
Perks & Benefits
The role is fully remote, likely with flexible time zones given the startup context, and offers competitive compensation plus meaningful equity at Series A. It provides real ownership over systems, direct impact on product and economics, and close collaboration with research, infra, and product teams. The culture emphasizes engineering quality over hype, with opportunities for career growth through hands-on technical work and ambiguity in a dynamic setting.
Full Job Description
About the Role
We’re looking for a Machine Learning Engineer to own and push the limits of model inference performance at scale. You’ll work at the intersection of research and production—turning cutting-edge models into fast, reliable, and cost-efficient systems that serve real users.
This role is ideal for someone who enjoys deep technical work, profiling systems down to the kernel/GPU level, and translating research ideas into production-grade performance gains.
What You’ll Do
Optimize inference latency, throughput, and cost for large-scale ML models in production
Profile and bottleneck GPU/CPU inference pipelines (memory, kernels, batching, IO)
Implement and tune techniques such as:
Quantization (fp16, bf16, int8, fp8)
KV-cache optimization & reuse
Speculative decoding, batching, and streaming
Model pruning or architectural simplifications for inference
Collaborate with research engineers to productionize new model architectures
Build and maintain inference-serving systems (e.g. Triton, custom runtimes, or bespoke stacks)
Benchmark performance across hardware (NVIDIA / AMD GPUs, CPUs) and cloud setups
Improve system reliability, observability, and cost efficiency under real workloads
What We’re Looking For
Strong experience in ML inference optimization or high-performance ML systems
Solid understanding of deep learning internals (attention, memory layout, compute graphs)
Hands-on experience with PyTorch (or similar) and model deployment
Familiarity with GPU performance tuning (CUDA, ROCm, Triton, or kernel-level optimizations)
Experience scaling inference for real users (not just research benchmarks)
Comfortable working in fast-moving startup environments with ownership and ambiguity
Nice to Have
Experience with LLM or long-context model inference
Knowledge of inference frameworks (TensorRT, ONNX Runtime, vLLM, Triton)
Experience optimizing across different hardware vendors
Open-source contributions in ML systems or inference tooling
Background in distributed systems or low-latency services
Why Join Us
Real ownership over performance-critical systems
Direct impact on product reliability and unit economics
Close collaboration with research, infra, and product
Competitive compensation + meaningful equity at Series A
A team that cares about engineering quality, not hype
Similar jobs
Found 6 similar jobs





