Senior Manager - Incident Response Engineering

This listing is synced directly from the company ATS.

Role Overview

As a Senior Manager of Incident Response Engineering at Confluent, you will lead a team of approximately five incident response engineers, ensuring 24/7 coverage across time zones. Your primary responsibilities include managing high-severity incidents, driving postmortem rigor, and implementing AI-driven improvements in incident response processes, ultimately enhancing customer trust and operational efficiency.

Perks & Benefits

This role offers a fully remote work setup, allowing for collaboration across AMER and APAC time zones. Confluent promotes a culture of belonging and values diverse perspectives, providing opportunities for career growth and development in a supportive environment. Additionally, the company emphasizes a customer-first approach and encourages innovation in incident management practices.

⚠️ This job was posted over 4 months ago and may no longer be open. We recommend checking the company's site for the latest status.

Full Job Description

We’re not just building better tech. We’re rewriting how data moves and what the world can do with it. With Confluent, data doesn’t sit still. Our platform puts information in motion, streaming in near real-time so companies can react faster, build smarter, and deliver experiences as dynamic as the world around them.

It takes a certain kind of person to join this team. Those who ask hard questions, give honest feedback, and show up for each other. No egos, no solo acts. Just smart, curious humans pushing toward something bigger, together.

One Confluent. One Team. One Data Streaming Platform.

About the Role:

Confluent Cloud processes millions of events per second across AWS, GCP, and Azure. When incidents happen at this scale they demand experienced, decisive, customer-focused leadership. We're investing in a dedicated Incident Response Engineering team: a small, specialized group that owns incident command and response, postmortems, and customer-facing root cause analysis for our most critical incidents.

As Senior Manager of Incident Response Engineering, you will lead a team of ~5 experienced incident response engineers providing 24/7 coverage across time zones. These are senior technical operators - people with deep systems intuition who can take command of complex, ambiguous situations and drive them to resolution. You own the program end-to-end: the people, the process, the tooling, and the outcomes. You are a player-coach and when the highest-severity incidents demand senior leadership, you step in and run them yourself.

This role requires someone who operates with conviction and autonomy. During a major incident, you are the decision-maker. You set the pace, direct the response, and own customer-facing communications quality. Between incidents, you are building the practices, tooling, and AI-driven capabilities that make each response faster, clearer, and more effective than the last. The right person sees incident response not as firefighting, but as a discipline that compounds in quality when led with rigor and intentionality.

This role sits within Cloud Architecture & Reliability (CAR), a horizontal organization that owns reliability standards, tooling, and operational programs across Confluent Engineering.

What You Will Do:

Build and Lead the Team
- Recruit, hire, and develop a team of senior incident response engineers distributed across AMER and APAC time zones
- Design sustainable on-call models with follow-the-sun coverage
Own Incident Response
- Provide incident command for high-severity and critical customer-impacting incidents, with your team as the primary rotation and you as the senior escalation point
- Set and enforce standards for how incidents are run: communications cadence, directing engagements with stakeholders, domain expert coordination, handoffs
- Drive a customer-first posture in every incident to ensure timely, accurate updates and clear ownership from detection through resolution
Drive Postmortem Rigor and Customer RCA Quality
- Own postmortem quality end-to-end: facilitation, root cause analysis, corrective action definition, and ensuring follow-through
- Manage the Customer Root Cause Analysis (CRCA) program, ensuring timely, technically accurate, clearly written documents that restore customer trust
- Coordinate upstream technical inputs from engineering teams; synthesize ambiguity into clear, actionable narratives
Advance Incident Response Through AI and Automation
- Drive an AI-centric approach to scaling incident operations using intelligent tooling to improve triage speed, documentation quality, and pattern detection without sacrificing rigor
- Partner with observability, supportability, and resiliency sub-functions with CAR to provide critical inputs into our platform evolution
- Own and evolve the incident management tooling stack with a bias towards agentic assistance
- Analyze incident data to identify recurring patterns and feed learnings back into engineering practices
- When incident load allows, direct your team's capacity toward runbook improvements, automation, and operational hygiene
Represent Cross-Functionally
- Partner with Legal, PR, and Customer Success on customer-facing communications during and after major incidents
- Brief engineering leadership and executives during active incidents with clarity and composure
- Be the person engineering teams proactively seek out when operational standards and incident practices need to improve

What You Will Bring:

10+ years in SRE, incident management, or reliability engineering, with at least 5 years managing teams in this space
Proven experience as an incident commander in high-severity, customer-impacting outages at scale. You've personally run incidents that mattered
Cloud infrastructure experience across at least one of AWS, GCP, or Azure
Deep understanding of distributed systems failure modes (Kafka/event streaming experience preferred, or demonstrated ability to rapidly master complex systems)
Strong track record with postmortem facilitation and driving corrective actions to completion
Excellent written communication with customers regarding root-cause analysis. You are comfortable stating things with conviction to executive audiences
Experience working with cross-functional stakeholders (legal, PR, customer success) during incident response
Track record of hiring and developing senior technical talent in a globally distributed, remote-first environment
Comfort operating with significant autonomy and making high-stakes decisions under pressure

What Gives You an Edge:

Experience with incident response in a multi-cloud context
Experience building an incident management function or team from scratch
Post-incident review methodologies beyond standard "5 whys" (e.g., Learning from Incidents, resilience engineering)
Demonstrated use of AI-assisted tooling to improve operational quality at scale

Ready to build what's next? Let’s get in motion.

Come As You Are

Belonging isn’t a perk here. It’s the baseline. We work across time zones and backgrounds, knowing the best ideas come from different perspectives. And we make space for everyone to lead, grow, and challenge what’s possible.

We’re proud to be an equal opportunity workplace. Employment decisions are based on job-related criteria, without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, veteran status, or any other classification protected by law.

Apply on original site

Similar jobs

Found 6 similar jobs

Principal, Product Manager

Confluent • Remote

Senior Software Engineer II

Confluent • Remote

Senior Trust & Assurance Program Manager

Confluent • Remote

Senior Software Engineer

Confluent • Remote

Staff Software Engineer I

Confluent • Remote

Staff Software Engineer

Confluent • Remote

Confluent

confluent.io

Confluent is a data streaming platform that enables organizations to process and analyze real-time data streams. Their primary product, Confluent Platform, is built around Apache Kafka, allowing businesses to build and manage data pipelines effectively. Typical customers include enterprises in various sectors such as finance, retail, and technology that require real-time data processing. The company fosters a remote-friendly culture, allowing employees to work from anywhere while maintaining strong collaboration through digital tools.

Industry

Technology

Remote-first

435 open positions

About this company (remote-wise)

Headquarters:: San Francisco, CA
Typical working hours:: Roughly US business hours

View company profile →

About the job

Posted onMar 4, 2026

LocationRemote

Skills

Incident ManagementCloud Infrastructure

AWS

GCPAzureDistributed SystemsPostmortem FacilitationAI-driven ToolsTechnical CommunicationTeam Leadership

Share this job

💌 Get remote jobs in your inbox

Subscribe to get the latest curated remote jobs every week.