Senior Staff Software Engineer
Role Overview
This is a senior-level Staff Software Engineer role on the Reliability Engineering team at NMI, focused on improving platform reliability, performance, and operational maturity. Day-to-day responsibilities include designing and building reliability frameworks, tooling, and standards, partnering with engineering teams to embed reliability into workflows, and contributing hands-on to production codebases. The hire will drive initiatives to shift from reactive incident response to proactive engineering, impacting uptime and operational confidence across the organization.
Perks & Benefits
The role is remote, as implied by the job board context, with no specific location listed, offering flexibility in work setup. It includes opportunities for technical leadership, mentorship, and influencing cross-team direction, supporting career growth in a product-oriented environment. Participation in on-call rotations is mentioned, with a focus on reducing operational load over time, suggesting a culture that values work-life balance and continuous improvement.
Full Job Description
NMI is building a mature, product-oriented Reliability Engineering function, and we’re looking for a Staff Software Engineer to play a key role in that evolution.
This role sits on the Reliability Engineering team, which focuses on improving the reliability, performance, and operational maturity of critical platform services. The team’s mission is to move the engineering organization from reactive incident response toward intentional, engineered reliability through strong systems, tooling, and standards.
As a Staff Engineer, you will operate beyond a single service or codebase, designing and building reliability frameworks, platform capabilities, and guardrails that improve uptime, observability, and operational confidence. This is a highly hands-on role with strong expectations around technical leadership, ownership, and delivery.
Key responsibilities:
Design and build reliability-focused frameworks, tooling, and standards that improve platform uptime, performance, and operational confidence. Drive initiatives that move reliability from reactive response to proactive engineering, emphasizing prevention, early detection, and fast recovery. Partner with engineering teams to embed reliability into system design, development practices, and deployment workflows. Establish and evolve observability practices, including metrics, logging, alerting, and dashboards that enable clear operational insight. Identify systemic risks and failure patterns, and lead efforts to address them through automation, architectural improvements, and process refinement. Contribute hands-on to production codebases, internal tools, and platform services with a focus on long-term maintainability. Influence technical direction across teams through design reviews, technical proposals, and clear written communication. Improve operational maturity through better incident practices, post-incident learning, and continuous improvement loops. Mentor engineers by modeling strong ownership, technical judgment, and disciplined delivery. Participate in on-call rotations, with a clear mandate to reduce operational load over time through engineering.Skills and experience: 8+ years of experience building and operating production-grade software systems in complex environments. Strong experiencePlease mention the word **STIMULATING** and tag RMTQyLjEzMi4yMTcuMjMw when applying to show you read the job post completely (#RMTQyLjEzMi4yMTcuMjMw). This is a beta feature to avoid spam applicants. Companies can search these words to find applicants that read this and see they're human.
Similar jobs
Found 2 similar jobs