Staff Machine Learning Systems Engineer (MLOps)
Role Overview
The Staff Machine Learning Systems Engineer at Hims & Hers will design, build, and operate the production infrastructure for AI systems, focusing on Kubernetes, CI/CD pipelines, and observability. This senior-level role requires collaboration with ML and product engineers to ensure reliability and security in a regulated healthcare environment. The hire will have a significant impact on patient outcomes by optimizing AI deployment and infrastructure.
Perks & Benefits
Hims & Hers offers a competitive salary and equity compensation, unlimited PTO, and comprehensive health benefits. The company's culture emphasizes a flexible and remote work environment, supporting mental health with quarterly days off. Team retreats and a commitment to diversity and inclusion further enhance the work experience, fostering a strong sense of belonging and ethical practices.
Full Job Description
Hims & Hers is the leading health and wellness platform, on a mission to help the world feel great through the power of better health. We are redefining healthcare by putting the customer first and delivering access to care that is affordable, accessible, and personal, from diagnosis to treatment to delivery. No two people are the same, so we provide access to personalized care designed for results. By normalizing health & wellness challenges and innovating on their solutions, we’re making better health outcomes easier to achieve.
Hims & Hers is a public company, traded on the NYSE under the ticker symbol “HIMS.” To learn more about the brand and offerings, you can visit hims.com/about and hims.com/how-it-works . For information on the company’s outstanding benefits, culture, and its talent-first flexible/remote work approach, see below and visit www.hims.com/careers-professionals.
About the Role:
We're hiring a Staff ML Systems Engineer to design, build, and operate the production infrastructure that powers AI across Hims & Hers. This is a deeply technical, hands-on infrastructure role focused on the systems underneath AI — the Kubernetes platform, CI/CD and GitOps pipelines, infrastructure-as-code, inference and model-serving infrastructure, and the observability and tracing stack that keeps AI services reliable, debuggable, and compliant in production.
You won't just deploy models — you'll own the machinery that lets every AI team ship and operate safely. You'll own critical systems like our EKS clusters, deployment and autoscaling infrastructure, IAM and secrets management, LLM tracing/observability pipelines (Langfuse, Datadog, OpenTelemetry), and the developer platform that AI and product engineers rely on daily. You'll partner with ML engineers, product engineers, and clinical teams to ensure our AI systems are reliable, observable, secure, and trustworthy in a regulated healthcare environment.
This role is ideal for someone who thinks in systems and infrastructure, cares deeply about reliability, security, and cost, and wants to define how AI runs in production at a company where it directly impacts patient outcomes.
You Will:
Own and scale the AI compute and deployment platform
Own and evolve our containerized application deployment platform and related systems for AI workloads, encompassing general process and job orchestration (e.g. Kubernetes) — cluster operations, node lifecycle, autoscaling (Karpenter), storage (EBS CSI), and workload isolation across staging and production.
Build and maintain GitOps-based deployment pipelines (Helm/Kustomize overlays, environment promotion) that let teams ship AI services safely and repeatably.
Design ephemeral/preview environments, feature-branched deployments, and nightly release pipelines so teams can validate AI changes in production-like conditions before release.
Drive efficiency and cost management across compute, autoscaling, and inference infrastructure.
Build inference and model-serving infrastructure
Operate and scale inference infrastructure and a multi-provider LLM AI gateway (e.g. Bedrock, Vertex, and other providers) — including credentials, rate limits, and failover.
Build reliable serving patterns for LLM-powered workflows: routing, grounding, tool execution, and context assembly at the platform level.
Create reusable infrastructure abstractions and contracts that standardize how AI services are deployed, configured, and consumed across the company.
Own observability, tracing, and reliability
Own the LLM/AI observability and tracing stack — provisioning and scaling systems like Langfuse, Datadog (dd-trace), OpenTelemetry tracing (OTLP), and the underlying datastores (e.g. ClickHouse) — so AI behavior is auditable and debuggable in production.
Build analytics and monitoring pipelines that surface latency, error, quality, and regression signals to engineering and clinical stakeholders.
Define SLOs, alerting, on-call runbooks, and incident response for AI infrastructure; lead troubleshooting and continuously raise platform reliability.
Scale the AI developer platform and CI/CD
Own and improve the monorepo build system and CI/CD pipelines for AI workloads — including eval workflows, Docker image builds, automated PR checks and convention enforcement, and cross-platform test execution.
Own shared infrastructure tooling, CLIs, and IaC modules (Terraform, Scalr) that AI and product engineers use daily.
Identify and eliminate platform bottlenecks — reducing CI/CD cycle times, build latency, and deployment friction — to improve developer velocity across the Applied AI organization.
Drive security, compliance, and governance at the systems level
Build IAM, OIDC, and secrets management as first-class infrastructure — scoped, least-privilege roles, write-only secret rotation, and cross-account access audits.
Encode security-by-default, scope boundaries, and access controls into the platform so AI services are HIPAA-compliant and privacy-first.
Partner with clinical, legal, security, and data platform teams (including Databricks/Unity Catalog access governance) to enforce compliant, auditable data access.
Set technical direction and raise the bar
Drive multi-quarter infrastructure initiatives, from cluster and deployment architecture to inference platform, GPU compute strategy, and observability evolution.
Write and lead technical design documents and design reviews, define infrastructure standards and development-workflow conventions, and contribute to technical governance across AI engineering.
Mentor engineers on reliability engineering, infrastructure-as-code, and MLOps best practices, and bridge the gap between prototypes and production-grade systems.
You Have:
8+ years of professional experience in infrastructure, platform, DevOps, or SRE engineering — with at least 3 years focused on ML/AI systems in production.
Deep, hands-on experience with Kubernetes (ideally EKS) and the cloud-native ecosystem — autoscaling, GitOps, Helm/Kustomize, operating clusters at scale, and general process/job orchestration.
Strong infrastructure-as-code skills (Terraform) and experience designing secure cloud architectures: IAM, OIDC, secrets management, and least-privilege access.
Strong proficiency in Python, with experience building production infrastructure tooling, CLIs, and data/observability pipelines.
2+ years of experience operating LLM-based systems in production (LLMOps) — inference routing, serving, tracing, and the reliability patterns needed to run them at scale.
Hands-on experience with observability/tracing stacks (Datadog, OpenTelemetry, Langfuse, or equivalent) and metrics/log/trace pipelines.
Experience designing and maintaining CI/CD pipelines, build systems, and developer tooling for fast-moving engineering teams.
A systems-and-operations mindset: you think about failure modes, SLOs, observability, security, and long-term maintainability before shipping.
Experience writing and leading technical design documents (TDDs/RFCs) for infrastructure-scale initiatives.
Strong collaboration skills across engineering, ML, product, security, and clinical teams.
A deep appreciation for safety, privacy, and security — ideally with experience in a regulated domain such as healthcare, fintech, or life sciences.
Nice to Have:
Experience with AWS (EKS, Bedrock, S3, CloudFront, IAM) and multi-cloud (GCP/Vertex AI) inference routing.
Experience with Databricks (MLflow, Unity Catalog, Spark, Delta) and data platform access governance.
Experience provisioning LLM observability infrastructure (Langfuse, ClickHouse, OpenTelemetry/OTLP tracing, LogFire) and LLM behavior monitoring.
Experience with Karpenter, cluster autoscaling, and cost optimization for ML compute.
Experience with monorepo build systems (Pants, Bazel) and large-scale CI/CD.
Experience building automated PR-review / convention-enforcement pipelines and developer-workflow standards.
Familiarity with Vertex AI Agent Builder, Vertex AI Model Registry, or GCP managed AI/ML services as a stretch growth area.
Contributions to open-source infrastructure, IaC modules, SDKs, or developer tooling projects.
Why Join Us
At Hims & Hers, you'll be part of a small, high-impact team defining how AI infrastructure runs in production for healthcare. The platform you build — compute, deployment, inference, observability, and security — is the foundation that every AI-powered experience depends on. Reliability, security, and developer velocity aren't afterthoughts here; they're the product.
Join us in building the infrastructure that makes healthcare AI smarter, safer, and more trustworthy.
Our Benefits (there are more but here are some highlights):
Competitive salary & equity compensation for full-time roles
Unlimited PTO, company holidays, and quarterly mental health days
Comprehensive health benefits including medical, dental & vision, and parental leave
Employee Stock Purchase Program (ESPP)
401k benefits with employer matching contribution
Offsite team retreats
We are committed to building a workforce that reflects diverse perspectives and prioritizes ethics, wellness, and a strong sense of belonging. If you're excited about this role, we encourage you to apply—even if you're not sure if your background or experience is a perfect match.
Hims considers all qualified applicants for employment, including applicants with arrest or conviction records, in accordance with the San Francisco Fair Chance Ordinance, the Los Angeles County Fair Chance Ordinance, the California Fair Chance Act, and any similar state or local fair chance laws.
It is unlawful in Massachusetts to require or administer a lie detector test as a condition of employment or continued employment. An employer who violates this law shall be subject to criminal penalties and civil liability.
Hims & Hers is committed to providing reasonable accommodations for qualified individuals with disabilities and disabled veterans in our job application procedures. If you need assistance or an accommodation due to a disability, please contact us at accommodations@forhims.com and describe the needed accommodation. Your privacy is important to us, and any information you share will only be used for the legitimate purpose of considering your request for accommodation. Hims & Hers gives consideration to all qualified applicants without regard to any protected status, including disability. Please do not send resumes to this email address.
To learn more about how we collect, use, retain, and disclose Personal Information, please visit our Global Candidate Privacy Statement.
Similar jobs
Found 6 similar jobs