Staff Site Reliability Engineer - Incident Management & Reliability (Remote - Canada)

Confluent
Confluent
Spain (Remote)RemoteCompetitivoPublicado hace 20 díasRemoto: Remoto
🇬🇧Inglés requeridoFulltime
Confluent

Staff Site Reliability Engineer - Incident Management & Reliability (Remote - Canada)

Anuncio original

We're not just building better tech. We're rewriting how data moves and what the world can do with it. With Confluent, data doesn't sit still. Our platform puts information in motion, streaming in near real-time so companies can react faster, build smarter, and deliver experiences as dynamic as the world around them.

It takes a certain kind of person to join this team. Those who ask hard questions, give honest feedback, and show up for each other. No egos, no solo acts. Just smart, curious humans pushing toward something bigger, together.

One Confluent. One Team. One Data Streaming Platform.


About the Role:

Confluent Cloud processes millions of events per second across AWS, GCP, and Azure. When incidents happen in a multi-cloud streaming platform, they happen at scale-data in motion, exactly-once semantics, and cascading failure modes that require deep systems thinking. We need an expert-level engineer who can drive proactive reliability improvements that prevent these incidents before they occur.

This role combines hands-on technical work with strategic program ownership. You'll spend roughly 75% of your time on engineering: building automation, improving tooling, analyzing systemic failure patterns, and designing reliability improvements. The remaining 25% is teaching and coordination: coaching teams through post-mortems, training incident commanders, and evolving our incident response practices.

You'll be part of a global team with follow-the-sun coverage, with clean handoffs that keep everyone working sustainable hours. This role sits within Cloud Architecture and Reliability - Supportability, a horizontal team that owns reliability standards and tooling across engineering. You're the person who makes us need incident management less.

What You Will Do:

  • Analyze systemic failure patterns and design reliability improvements that prevent incident recurrence

  • Own Rootly configuration, workflows, and integrations with PagerDuty, Jira, Confluence, and Slack

  • Define and maintain SLO/SLA frameworks; use error budgets to guide reliability investments

  • Own standards, practices, and continuous improvement of incident response across engineering

  • Edit and review customer-facing incident documents (CRCAs) to ensure quality and clarity

  • Develop and deliver training programs; coach teams through post-mortems

  • Partner with engineering leaders to elevate reliability practices org-wide

What You Will Bring:

  • 10+ years of relevant experience in SRE, incident management, or reliability engineering

  • Cloud experience with at least one of AWS, GCP, or Azure (we run all three)

  • Experience navigating reliability/incident programs at 500+ engineer organizations

  • Deep expertise with incident management tooling (Rootly, PagerDuty, or similar)

  • Strong understanding of distributed systems and failure modes at scale

  • Deep experience with observability: metrics, logging, tracing

  • Kubernetes and container orchestration experience

  • Understanding of CI/CD pipelines and release processes

  • Strong written communication (design docs, runbooks, post-mortems)

  • Experience driving org-wide process and cultural changes

  • Kafka/event streaming expertise preferred, or demonstrated rapid mastery of complex systems

Ready to build what's next? Let's get in motion.

Come As You Are

Belonging isn't a perk here. It's the baseline. We work across time zones and backgrounds, knowing the best ideas come from different perspectives. And we make space for everyone to lead, grow, and challenge what's possible.

We're proud to be an equal opportunity workplace. Employment decisions are based on job-related criteria, without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, veteran status, or any other classification protected by law.

Privacy Statement

Confluent is an IBM subsidiary which has been acquired by IBM and will be integrated into the IBM organization. By proceeding with this application, you understand that Confluent will share your personal information with other IBM affiliates involved in your recruitment process, wherever these are located. More Information on how IBM protects your personal information, including the safeguards in case of cross-border data transfer, are available here.

Remoto

Engineering Manager, Connect

Spain (Remote)
Nuevo

Sr. Manager, KORA Orchestration Global

Spain
5d

Staff Software Engineer I – Control Plane Core Infrastructure

Spain
5d
Remoto

Head of Product Led Growth Marketing

Spain (Remote)
1sem
Remoto

Senior Product Manager - Hybrid Control Plane for Observability & Management

Spain (Remote)
1sem

Senior Manager, Engineering - KORA

Spain
2sem
Remoto

Senior Technical Program Manager

Spain (Remote)
2sem
Remoto

Manager II, Engineering – Secure Compute Platform

Spain (Remote)
2sem
Remoto

Senior Director, Business Systems

Spain (Remote)
2sem

Software Engineer- Cloud Traffic

Spain
2sem

Senior Manual QA Engineer

Central Europe
Nuevo
Remoto

Senior NodeJS Backend Developer

València (Remote)
Nuevo
Híbrido

.NET Engineer

Barcelona (Hybrid)
Nuevo
Híbrido

Senior Engineering Manager, Core Experience - Commerce

Barcelona (Hybrid)
Nuevo
Híbrido

Senior Engineering Manager - Media

Barcelona (Hybrid)
Nuevo

Junior Engineer - Ruby (London)

Barcelona
Nuevo
Híbrido

Engineer - Full Stack

Barcelona (Hybrid)
Nuevo
Híbrido

Machine Learning Engineering Manager - Supply

Barcelona (Hybrid)
Nuevo
Híbrido

Android Engineer

Barcelona (Hybrid)
Nuevo
Híbrido

Junior Android Engineer

Barcelona (Hybrid)
Nuevo
Remoto

Graphic Designer (Marketing)

Barcelona (Remote)
Nuevo
Remoto

Team Lead Payments Operation

Barcelona (Remote)
Nuevo
Remoto

Head of User Acquisition

València (Remote)
Nuevo
Remoto

Senior NodeJS Backend Developer

València (Remote)
Nuevo
Remoto

Implementation Consultant

Spain (Remote)
Nuevo
Remoto

Business Development Representative UK&I

Spain (Remote)
Nuevo
Remoto

Diamond Grading Associate - Netherlands

Barcelona (Remote)
Nuevo
Remoto

Salesforce Developer

Barcelona (Remote)
Nuevo
Remoto

Senior Product Manager - Growth

Barcelona (Remote)
Nuevo
Remoto

Senior Logistics Lead (Netherlands)

Barcelona (Remote)
50 mil € - 60 mil €Nuevo
Remoto

Revenue Operations Business Partner

Spain (Remote)
94 mil US$ - 134 mil US$Nuevo
Remoto

Senior Manager, GTM Systems

Spain (Remote)
140 mil US$ - 200 mil US$Nuevo

Candidatura gestionada por Confluent