Machine Learning Engineer — Training Optimization

This listing is synced directly from the company ATS.

Role Overview

This senior-level Machine Learning Engineer role focuses on optimizing large-scale model training pipelines for speed, stability, and cost, working at the intersection of research and production. Day-to-day responsibilities include improving distributed training strategies, tuning optimizers and precision, and building robust training infrastructure. The hire will have high impact and ownership in a small, highly technical team, directly influencing how fast the company can iterate and scale new models.

Perks & Benefits

The role is fully remote, likely with flexible time zones given the remote nature, and offers competitive compensation plus meaningful equity. You'll work on cutting-edge models and training systems at scale in a Series-A stage company, providing real ownership and fast feedback loops. The culture emphasizes engineering quality and research rigor, with opportunities for career growth through high-impact projects.

⚠️ This job was posted over 5 months ago and may no longer be open. We recommend checking the company's site for the latest status.

Full Job Description

About the Role

We’re looking for an ML Engineer focused on training optimization to help us scale and improve large-scale model training. You’ll work at the intersection of research and production, optimizing training pipelines for speed, stability, and cost—while collaborating closely with researchers pushing model architecture and capability forward.

This is a high-impact role with real ownership: your work directly affects how fast we can iterate, how large we can scale, and how efficiently we deploy new models.

What You’ll Do

Optimize large-scale model training pipelines (throughput, convergence, stability, and cost)
Improve distributed training strategies (data, model, and pipeline parallelism)
Tune optimizers, schedulers, batch sizing, and precision (bf16 / fp16 / fp8)
Reduce training time and compute cost via profiling, bottleneck analysis, and systems-level improvements
Collaborate with researchers on architecture-aware training strategies
Build and maintain robust training infrastructure (checkpointing, fault tolerance, reproducibility)
Evaluate and integrate new training techniques (e.g. gradient checkpointing, ZeRO, FSDP, custom kernels)
Own training performance metrics and continuously push them forward

What We’re Looking For

Strong experience training large neural networks (LLMs or similarly large models)
Hands-on experience with training optimization (not just model usage)
Solid understanding of:
- Backpropagation, optimization algorithms, and training dynamics
- Distributed systems for ML training
Experience with PyTorch (required)
Comfort working close to hardware (GPUs, memory, networking constraints)
Ability to move fluidly between research ideas and production-ready code

Nice to Have

Experience with large-scale distributed training (multi-node, multi-GPU)
Familiarity with DeepSpeed, FSDP, Megatron, or custom training stacks
Experience optimizing training on AMD or NVIDIA GPUs
Contributions to open-source ML infrastructure or research codebases
Exposure to non-Transformer architectures (RNNs, hybrid models, etc.)

Why Join Us

Real ownership at Series-A stage — your work shapes the company’s trajectory
Work on cutting-edge models and training systems at scale
Small, highly technical team with fast feedback loops
Strong emphasis on engineering quality and research rigor
Competitive compensation + meaningful equity

Apply on original site

Similar jobs

Found 6 similar jobs

Founding Business Development Rep (AI Cloud US/CA)

Featherless AI • Remote

Content Marketer

Featherless AI • Remote

Founding Account Executive (AI Cloud)

Featherless AI • Remote

Business Development Rep (AI Cloud)

Featherless AI • Remote

AI Researcher — Training Optimization

Featherless AI • Remote

AI Researcher – Multilingual Data

Featherless AI • Remote

Browse more jobs in:

Machine Learning Engineer Jobs

Featherless AI

featherless.ai

Featherless AI specializes in developing lightweight and efficient artificial intelligence solutions tailored for resource-constrained environments. Their typical customers include tech startups, IoT device manufacturers, and enterprises seeking to integrate AI into mobile and edge computing applications. The company's main product is a suite of optimized AI models and tools that reduce computational overhead while maintaining high performance. As a fully remote organization, Featherless AI fosters a distributed work culture that emphasizes asynchronous communication and flexible scheduling to support a global team.

Industry

Artificial Intelligence

Fully remote

23 open positions

About this company (remote-wise)

Headquarters:: Distributed / remote-first
Team style:: Async-ish, remote-first

View company profile →

About the job

Posted onJan 22, 2026

LocationRemote

Skills

PyTorchDistributed TrainingTraining OptimizationDeepSpeedFSDPMegatronGPU OptimizationModel ArchitectureCheckpointingFault Tolerance

Share this job

💌 Get remote jobs in your inbox

Subscribe to get the latest curated remote jobs every week.