Senior Manager - Incident Response Engineering

This listing is synced directly from the company ATS.

Role Overview

As a Senior Manager of Incident Response Engineering at Confluent, you will lead a team of approximately five incident response engineers, ensuring 24/7 coverage across time zones. Your primary responsibilities include managing high-severity incidents, driving postmortem rigor, and implementing AI-driven improvements in incident response processes, ultimately enhancing customer trust and operational efficiency.

Perks & Benefits

This role offers a fully remote work setup, allowing for collaboration across AMER and APAC time zones. Confluent promotes a culture of belonging and values diverse perspectives, providing opportunities for career growth and development in a supportive environment. Additionally, the company emphasizes a customer-first approach and encourages innovation in incident management practices.

Full Job Description

We’re not just building better tech. We’re rewriting how data moves and what the world can do with it. With Confluent, data doesn’t sit still. Our platform puts information in motion, streaming in near real-time so companies can react faster, build smarter, and deliver experiences as dynamic as the world around them.

It takes a certain kind of person to join this team. Those who ask hard questions, give honest feedback, and show up for each other. No egos, no solo acts. Just smart, curious humans pushing toward something bigger, together.

One Confluent. One Team. One Data Streaming Platform.

About the Role:

Confluent Cloud processes millions of events per second across AWS, GCP, and Azure. When incidents happen at this scale they demand experienced, decisive, customer-focused leadership. We're investing in a dedicated Incident Response Engineering team: a small, specialized group that owns incident command and response, postmortems, and customer-facing root cause analysis for our most critical incidents.

As Senior Manager of Incident Response Engineering, you will lead a team of ~5 experienced incident response engineers providing 24/7 coverage across time zones. These are senior technical operators - people with deep systems intuition who can take command of complex, ambiguous situations and drive them to resolution. You own the program end-to-end: the people, the process, the tooling, and the outcomes. You are a player-coach and when the highest-severity incidents demand senior leadership, you step in and run them yourself.

This role requires someone who operates with conviction and autonomy. During a major incident, you are the decision-maker. You set the pace, direct the response, and own customer-facing communications quality. Between incidents, you are building the practices, tooling, and AI-driven capabilities that make each response faster, clearer, and more effective than the last. The right person sees incident response not as firefighting, but as a discipline that compounds in quality when led with rigor and intentionality.

This role sits within Cloud Architecture & Reliability (CAR), a horizontal organization that owns reliability standards, tooling, and operational programs across Confluent Engineering.

What You Will Do:

  • Build and Lead the Team

    - Recruit, hire, and develop a team of senior incident response engineers distributed across AMER and APAC time zones

    - Design sustainable on-call models with follow-the-sun coverage

  • Own Incident Response

    - Provide incident command for high-severity and critical customer-impacting incidents, with your team as the primary rotation and you as the senior escalation point

    - Set and enforce standards for how incidents are run: communications cadence, directing engagements with stakeholders, domain expert coordination, handoffs

    - Drive a customer-first posture in every incident to ensure timely, accurate updates and clear ownership from detection through resolution

  • Drive Postmortem Rigor and Customer RCA Quality

    - Own postmortem quality end-to-end: facilitation, root cause analysis, corrective action definition, and ensuring follow-through

    - Manage the Customer Root Cause Analysis (CRCA) program, ensuring timely, technically accurate, clearly written documents that restore customer trust

    - Coordinate upstream technical inputs from engineering teams; synthesize ambiguity into clear, actionable narratives

  • Advance Incident Response Through AI and Automation

    - Drive an AI-centric approach to scaling incident operations using intelligent tooling to improve triage speed, documentation quality, and pattern detection without sacrificing rigor

    - Partner with observability, supportability, and resiliency sub-functions with CAR to provide critical inputs into our platform evolution

    - Own and evolve the incident management tooling stack with a bias towards agentic assistance

    - Analyze incident data to identify recurring patterns and feed learnings back into engineering practices

    - When incident load allows, direct your team's capacity toward runbook improvements, automation, and operational hygiene

  • Represent Cross-Functionally

    - Partner with Legal, PR, and Customer Success on customer-facing communications during and after major incidents

    - Brief engineering leadership and executives during active incidents with clarity and composure

    - Be the person engineering teams proactively seek out when operational standards and incident practices need to improve

What You Will Bring:

  • 10+ years in SRE, incident management, or reliability engineering, with at least 3 years managing teams in this space

  • Proven experience as an incident commander in high-severity, customer-impacting outages at scale. You've personally run incidents that mattered

  • Cloud infrastructure experience across at least one of AWS, GCP, or Azure

  • Deep understanding of distributed systems failure modes (Kafka/event streaming experience preferred, or demonstrated ability to rapidly master complex systems)

  • Strong track record with postmortem facilitation and driving corrective actions to completion

  • Excellent written communication with customers regarding root-cause analysis. You are comfortable stating things with conviction to executive audiences

  • Experience working with cross-functional stakeholders (legal, PR, customer success) during incident response

  • Track record of hiring and developing senior technical talent in a globally distributed, remote-first environment

  • Comfort operating with significant autonomy and making high-stakes decisions under pressure

What Gives You an Edge:

  • Experience with incident response in a multi-cloud context

  • Experience building an incident management function or team from scratch

  • Post-incident review methodologies beyond standard "5 whys" (e.g., Learning from Incidents, resilience engineering)

  • Demonstrated use of AI-assisted tooling to improve operational quality at scale

Ready to build what's next? Let’s get in motion.

Come As You Are

Belonging isn’t a perk here. It’s the baseline. We work across time zones and backgrounds, knowing the best ideas come from different perspectives. And we make space for everyone to lead, grow, and challenge what’s possible.

We’re proud to be an equal opportunity workplace. Employment decisions are based on job-related criteria, without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, veteran status, or any other classification protected by law.

Similar jobs

Found 6 similar jobs