Site Reliability Engineer, Infra (Americas)

This listing is synced directly from the company ATS.

Role Overview

As a Site Reliability Engineer at Resend, you will design and operate systems to ensure platform reliability, scalability, and observability. This senior-level role involves building automation for recovery and self-healing, improving monitoring with tools like Datadog, and collaborating with engineering teams to define SLOs and incident processes. You will have a direct impact on maintaining fast and reliable email delivery for thousands of daily users.

Perks & Benefits

This is a 100% remote position with flexible working schedules, allowing autonomy to ship solutions and work across time zones in the Americas. The team emphasizes honesty, low-ego collaboration, and ownership of problems, fostering a modern tech stack environment. Career growth is supported through continuous improvement initiatives and a focus on building reliable systems.

Full Job Description

Resend is building the most accessible email platform for developers. As we’ve grown to over 15K customers and continue to onboard thousands of new users every day, the challenge of maintaining a reliable, scalable, and observable platform has grown with it.

You’ll design and operate the systems that keep Resend fast, reliable, and self-healing. From monitoring pipelines to automation, you’ll help build the foundation that allows every engineer to move confidently and safely.

In this role you will...

  • Evolve and shape our on-call processes — from detection to resolution

  • Build automation for recovery, scaling, and self-healing systems

  • Improve observability across the stack: logs, metrics, traces, and dashboards

  • Define and track SLOs for core systems like email delivery, API latency, and queue performance

  • Collaborate closely with engineering teams to design for reliability, not just react to incidents

  • Codify playbooks, postmortems, and reliability standards

  • Work with infrastructure spanning AWS, queues, databases, and workers

You will be a perfect fit if you...

  • Bring 5+ years of experience in Site Reliability, Platform, or Infrastructure Engineering

  • Build and enhance backend services that drive user-facing features

  • Have deep experience with Node.js and TypeScript (Express, Hono, Next.js)

  • Infrastructure and reliability skills (Datadog, AWS, Terraform, CDK)

  • Are fluent in writing and speaking English

  • Have strong experience with observability and monitoring tools (Datadog, Grafana, OpenTelemetry)

  • Understand distributed systems: queues, workers, caching, databases, networking

  • Know how to design systems with safety and fail-safe operations in mind

  • Are comfortable working across the stack — from load balancers to delivery pipelines

  • Care deeply about incident management, postmortems, and continuous improvement

What it means to join the team:

  • Autonomy to "just ship it"

  • 100% remote team with flexible working schedules

  • Modern tech stack

  • Honest and low-ego team

  • Ownership of problems and solutions

About Resend

We are building the modern email sending platform for developers. We care deeply about quality, creating for everyone and building in the open. We started with an open source project in 2022. Now, we onboard nearly 100 paying customers every day and foster a growing developer community.

Our fully remote team of 28 humans spans 7 countries... and counting. We’re backed by a16z, Y Combinator, Basecase, and other top investors.

Read more about how we work, how we hire, and what we value here.

Similar jobs

Found 6 similar jobs

Browse more jobs in: