Machine Learning Engineer — Inference Optimization

This listing is synced directly from the company ATS.

Role Overview

This senior-level Machine Learning Engineer role focuses on optimizing inference performance for large-scale ML models in production. Day-to-day responsibilities include profiling GPU/CPU pipelines, implementing techniques like quantization and speculative decoding, and building inference-serving systems such as Triton. The hire will have direct ownership over performance-critical systems, collaborating with research and infra teams to impact product reliability and cost efficiency in a fast-paced startup environment.

Perks & Benefits

The role is fully remote, likely with flexible time zones given the startup context, and offers competitive compensation plus meaningful equity at Series A. It provides real ownership over systems, direct impact on product and economics, and close collaboration with research, infra, and product teams. The culture emphasizes engineering quality over hype, with opportunities for career growth through hands-on technical work and ambiguity in a dynamic setting.

⚠️ This job was posted over 4 months ago and may no longer be open. We recommend checking the company's site for the latest status.

Full Job Description

About the Role

We’re looking for a Machine Learning Engineer to own and push the limits of model inference performance at scale. You’ll work at the intersection of research and production—turning cutting-edge models into fast, reliable, and cost-efficient systems that serve real users.

This role is ideal for someone who enjoys deep technical work, profiling systems down to the kernel/GPU level, and translating research ideas into production-grade performance gains.

What You’ll Do

Optimize inference latency, throughput, and cost for large-scale ML models in production
Profile and bottleneck GPU/CPU inference pipelines (memory, kernels, batching, IO)
Implement and tune techniques such as:
- Quantization (fp16, bf16, int8, fp8)
- KV-cache optimization & reuse
- Speculative decoding, batching, and streaming
- Model pruning or architectural simplifications for inference
Collaborate with research engineers to productionize new model architectures
Build and maintain inference-serving systems (e.g. Triton, custom runtimes, or bespoke stacks)
Benchmark performance across hardware (NVIDIA / AMD GPUs, CPUs) and cloud setups
Improve system reliability, observability, and cost efficiency under real workloads

What We’re Looking For

Strong experience in ML inference optimization or high-performance ML systems
Solid understanding of deep learning internals (attention, memory layout, compute graphs)
Hands-on experience with PyTorch (or similar) and model deployment
Familiarity with GPU performance tuning (CUDA, ROCm, Triton, or kernel-level optimizations)
Experience scaling inference for real users (not just research benchmarks)
Comfortable working in fast-moving startup environments with ownership and ambiguity

Nice to Have

Experience with LLM or long-context model inference
Knowledge of inference frameworks (TensorRT, ONNX Runtime, vLLM, Triton)
Experience optimizing across different hardware vendors
Open-source contributions in ML systems or inference tooling
Background in distributed systems or low-latency services

Why Join Us

Real ownership over performance-critical systems
Direct impact on product reliability and unit economics
Close collaboration with research, infra, and product
Competitive compensation + meaningful equity at Series A
A team that cares about engineering quality, not hype

Apply on original site

Similar jobs

Found 6 similar jobs

Founding Account Executive (AI Cloud)

Featherless AI • Remote

Business Development Rep (AI Cloud)

Featherless AI • Remote

AI Researcher — Training Optimization

Featherless AI • Remote

AI Researcher – Multilingual Data

Featherless AI • Remote

AI Researcher — AI Architecture Research

Featherless AI • Remote

AI Researcher — Distillation

Featherless AI • Remote

Browse more jobs in:

Machine Learning Engineer Jobs

Featherless AI

featherless.ai

Featherless AI specializes in developing lightweight and efficient artificial intelligence solutions tailored for resource-constrained environments. Their typical customers include tech startups, IoT device manufacturers, and enterprises seeking to integrate AI into mobile and edge computing applications. The company's main product is a suite of optimized AI models and tools that reduce computational overhead while maintaining high performance. As a fully remote organization, Featherless AI fosters a distributed work culture that emphasizes asynchronous communication and flexible scheduling to support a global team.

Industry

Artificial Intelligence

Fully remote

21 open positions

About this company (remote-wise)

Headquarters:: Distributed / remote-first
Team style:: Async-ish, remote-first

View company profile →

About the job

Posted onJan 22, 2026

LocationRemote

Skills

PyTorchCUDATritonTensorRTONNX RuntimevLLMGPU performance tuningQuantizationSpeculative decoding

Share this job

💌 Get remote jobs in your inbox

Subscribe to get the latest curated remote jobs every week.