Staff Site Reliability Engineer - Incident Management & Reliability (Remote - Canada)

This listing is synced directly from the company ATS.

Role Overview

This is a senior-level role as a Staff Site Reliability Engineer focused on incident management and reliability improvements. Day-to-day, you'll spend 75% of your time on hands-on engineering tasks like building automation, analyzing failure patterns, and designing reliability improvements, and 25% on coaching teams through post-mortems and training incident commanders. You'll work within a global team in Cloud Architecture and Reliability - Supportability, driving proactive reliability standards and tooling to prevent incidents across a multi-cloud streaming platform.

Perks & Benefits

The role is fully remote, specifically in Canada, with a global team offering follow-the-sun coverage and clean handoffs for sustainable hours. It emphasizes a collaborative culture with no egos, focusing on honest feedback and teamwork, and provides opportunities for career growth through org-wide process changes and training programs. Benefits likely include standard tech perks like health insurance and flexible schedules, though not explicitly stated.

Full Job Description

We’re not just building better tech. We’re rewriting how data moves and what the world can do with it. With Confluent, data doesn’t sit still. Our platform puts information in motion, streaming in near real-time so companies can react faster, build smarter, and deliver experiences as dynamic as the world around them.

It takes a certain kind of person to join this team. Those who ask hard questions, give honest feedback, and show up for each other. No egos, no solo acts. Just smart, curious humans pushing toward something bigger, together.

One Confluent. One Team. One Data Streaming Platform.

About the Role:

Confluent Cloud processes millions of events per second across AWS, GCP, and Azure. When incidents happen in a multi-cloud streaming platform, they happen at scale—data in motion, exactly-once semantics, and cascading failure modes that require deep systems thinking. We need an expert-level engineer who can drive proactive reliability improvements that prevent these incidents before they occur.

This role combines hands-on technical work with strategic program ownership. You'll spend roughly 75% of your time on engineering: building automation, improving tooling, analyzing systemic failure patterns, and designing reliability improvements. The remaining 25% is teaching and coordination: coaching teams through post-mortems, training incident commanders, and evolving our incident response practices.

You'll be part of a global team with follow-the-sun coverage, with clean handoffs that keep everyone working sustainable hours. This role sits within Cloud Architecture and Reliability - Supportability, a horizontal team that owns reliability standards and tooling across engineering. You're the person who makes us need incident management less.

What You Will Do:

Analyze systemic failure patterns and design reliability improvements that prevent incident recurrence
Own Rootly configuration, workflows, and integrations with PagerDuty, Jira, Confluence, and Slack
Define and maintain SLO/SLA frameworks; use error budgets to guide reliability investments
Own standards, practices, and continuous improvement of incident response across engineering
Edit and review customer-facing incident documents (CRCAs) to ensure quality and clarity
Develop and deliver training programs; coach teams through post-mortems
Partner with engineering leaders to elevate reliability practices org-wide

What You Will Bring:

10+ years of relevant experience in SRE, incident management, or reliability engineering
Cloud experience with at least one of AWS, GCP, or Azure (we run all three)
Experience navigating reliability/incident programs at 500+ engineer organizations
Deep expertise with incident management tooling (Rootly, PagerDuty, or similar)
Strong understanding of distributed systems and failure modes at scale
Deep experience with observability: metrics, logging, tracing
Kubernetes and container orchestration experience
Understanding of CI/CD pipelines and release processes
Strong written communication (design docs, runbooks, post-mortems)
Experience driving org-wide process and cultural changes
Kafka/event streaming expertise preferred, or demonstrated rapid mastery of complex systems

Ready to build what's next? Let’s get in motion.

Come As You Are

Belonging isn’t a perk here. It’s the baseline. We work across time zones and backgrounds, knowing the best ideas come from different perspectives. And we make space for everyone to lead, grow, and challenge what’s possible.

We’re proud to be an equal opportunity workplace. Employment decisions are based on job-related criteria, without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, veteran status, or any other classification protected by law.

Privacy Statement

Confluent is an IBM subsidiary which has been acquired by IBM and will be integrated into the IBM organization. By proceeding with this application, you understand that Confluent will share your personal information with other IBM affiliates involved in your recruitment process, wherever these are located. More Information on how IBM protects your personal information, including the safeguards in case of cross-border data transfer, are available here.

Apply on original site

Similar jobs

Found 6 similar jobs

Senior Software Engineer - Connect Platform

Confluent • Remote

Director, Governance, Risk and Compliance (GRC)

Confluent • Remote

Senior Director, Business Technology Engineering

Confluent • Remote

Senior Software Engineer II

Confluent • Remote

Staff Software Engineer - Apache Kafka

Confluent • Remote

Senior Engineering Manager - Metrics Platform

Confluent • Remote

Browse more jobs in:

Devops Engineer Jobs

Confluent

confluent.io

Confluent is a data streaming platform that enables organizations to process and analyze real-time data streams. Their primary product, Confluent Platform, is built around Apache Kafka, allowing businesses to build and manage data pipelines effectively. Typical customers include enterprises in various sectors such as finance, retail, and technology that require real-time data processing. The company fosters a remote-friendly culture, allowing employees to work from anywhere while maintaining strong collaboration through digital tools.

Industry

Technology

Remote-first

397 open positions

About this company (remote-wise)

Headquarters:: San Francisco, CA
Typical working hours:: Roughly US business hours

View company profile →

About the job

Posted onJan 23, 2026

LocationRemote

Skills

AWS

GCPAzure

Kubernetes

RootlyPagerDutyCI/CDObservabilityDistributed SystemsIncident Management

Share this job

💌 Get remote jobs in your inbox

Subscribe to get the latest curated remote jobs every week.