Open source developer collaboration platform

Principal Site Reliability Engineer (Remote - US/Canada/Brazil/Mexico/Colombia/Chile)

Location
Remote
Job Type
Full-time
Experience
6+ years
Apply to Mattermost and hundreds of other fast-growing YC startups with a single profile.
Apply to role ›

About the role

Mattermost is an open source platform for secure collaboration across the entire software development lifecycle. Hundreds of thousands of developers around the globe trust Mattermost to increase their productivity by bringing together team communication, task and project management, and workflow orchestration into a unified platform for agile software development.

Founded in 2016, Mattermost’s open source platform powers over 800,000 workspaces worldwide with the support of over 4,000 contributors from across the developer community. The company serves over 800 customers, including European Parliament, NASA, Nasdaq, Samsung, SAP, United States Air Force and Wealthfront, and is backed by world-class investors including Battery Ventures, Redpoint, S28 Capital, YC Continuity. To learn more, visit www.mattermost.com.

We value high impact work, ownership, self-awareness and being focused on customer success. If these values match who you are, we hope you'll learn more about working at Mattermost and apply!

We are looking for an engineer with demonstrated experience in software development and infrastructure using Kubernetes. You will be ensuring high reliability and scaling of Mattermost’s new SaaS offering through building tools, deploying infrastructure and automation in Kubernetes.

Here is some of the challenges and work of SRE team:

Responsibilities:

  • Build services and tools to ensure the stability of Mattermost’s SaaS offering
  • Define infrastructure in code with IaC tools like Terraform
  • Write thoughtful and high-quality code in Go
  • Follow our engineering best practices, and ensure alignment with our Leadership Principles
  • Provide technical mentorship for fellow engineers
  • Develop services to handle automatic recovery from incidents and disasters
  • Automate incident or disaster simulations to identify blindspots
  • Set technical vision and innovate to be on the forefront of self-healing SaaS services
  • Implement, maintain and tune monitoring and alerting systems
  • Deploy applications to and manage Kubernetes clusters
  • Participate in our on-call rotation to respond to incidents and resolve problems.

Required Background/Skills:

  • Bachelor's degree in Computer Science or related fields, or significant professional DevOps or SRE experience
  • 5+ years of previous experience as a developer or SRE with operational responsibilities
  • Proven experience responding on-call to incidents with superior knowledge of incident response processes
  • Strong skills and experience working with Kubernetes inside and out
  • Strong skills and experience working with infrastructure as code tools, such as Terraform
  • Solid programming skills and experience with or an ability to quickly become proficient in Go
  • Familiarity with container systems such as Kubernetes & Docker
  • Familiarity with GitOps and Chaos Engineering
  • Ability and willingness to be on-call

Preferences:

  • Experience with distributed application systems using HTTP, WebSockets, RPC, pub/sub, etc. at scale
  • Open source contributions to related projects
  • Knowledge of Grafana and Prometheus suite
  • Comfortable with GitHub, Jira, Jenkins, CircleCI
  • Experience with WebRTC for real-time communication architectures
  • Experience working in open source communities

Why you should join Mattermost

Mattermost is an open source, remote-first enterprise software company headquartered in Palo Alto, California with engineering teams around the world. Our first product is an open source Slack-alternative for high trust DevOps teams who need to execute modern DevOps workflows while meeting stringent regulatory requirements.

Mattermost
Founded:
Team Size:150
Location:Palo Alto
Founders
Ian Tien
Ian Tien
CEO