Senior Site Reliability Engineer

Date: Jul 10, 2025

Location: Hanoi, VN

Company: Optimizely

At Optimizely, we're on a mission to help people unlock their digital potential. We do that by reinventing how marketing and product teams work to create and optimize digital experiences across all channels. With Optimizely One, our industry-first operating system for marketers, we offer teams flexibility and choice to build their stack their way with our fully SaaS, fully decoupled, and highly composable solution.  

We are proud to help more than 10,000 businesses, including H&M, PayPal, Zoom, and Toyota, enrich their customer lifetime value, increase revenue and grow their brands. Our innovation and excellence have earned us numerous recognitions as a leader by industry analysts such as Gartner, Forrester, and IDC, reinforcing our role as a trailblazer in MarTech. 

At our core, we believe work is about more than just numbers -- it's about the people. Our culture is dynamic and constantly evolving, shaped by every employee, their actions and their stories. With over 1500 Optimizers spread across 12 global locations, our diverse team embodies the "One Optimizely" spirit, emphasizing collaboration and continuous improvement, while fostering a culture where every voice is heard and valued. 

Hiring Manager: Hoan Le (hoan.le@optimizely.com)

Introduction

SREs at Optimizely are focused on making us the most reliable, performant, and trustworthy Digital Experience Optimization platform ever. Our engineering teams have built data pipelines that process 10 billion events daily and applications that support powerful experimentation and collaboration workflows at scale. Our platforms are built on AWS and GCP. We use technologies such as Kafka, Samza, HBase, MySQL, and Postgres. We build and manage our systems using TravisCI, Jenkins, Docker, Kubernetes, Terraform, and Chef. We use a combination of managed and self-hosted approaches. This is a unique opportunity to lead the engineering organization in areas of standardized automated infrastructure and service provisioning and orchestration, service-oriented architectural excellence, and forward-looking planning and execution of large technical project
We are looking for a Senior Site Reliability Engineer to help build and scale our CloudOps capabilities. You will be responsible for designing, implementing, and operating critical infrastructure and platform services while collaborating closely with engineering, support, and product teams to improve the reliability, scalability, and performance of our systems.
This is a hands-on technical role where you will be instrumental in shaping the SRE culture,
driving automation, and ensuring high availability across all services.

Job Responsibilities

  • Champion a Site Reliability Engineering culture across the organization by sharing best practices, tools, documentation, and code.
  • Identify and automate manual operational tasks using scripting, infrastructure-as-code, and CI/CD pipelines.
  • Build and maintain observability (monitoring, logging, tracing) for all production systems to ensure reliability, availability, and performance.
  • Proactively monitor alerts across all platforms and coordinate with SRE, Operations, Engineering, and Support teams to ensure quick detection and resolution of incidents—minimizing MTTA/MTTR.
  • Lead and manage on-call rotations, driving a blameless incident management and postmortem culture.
  • Collaborate with development teams to define and implement SLOs, SLIs, and error budgets.
  • Ensure uptime SLAs are met through robust automation, testing, monitoring, and operational best practices.
  • Create and maintain runbooks, playbooks, and system documentation to ensure operational readiness and knowledge sharing.

Knowledge and Experience

  • Strong experience in Linux Systems Administration in cloud or virtualized environments
  • Proficiency in infrastructure-as-code tools such as Terraform
  • Hands-on experience with configuration management tools like Ansible or SaltStack
  • Skilled in scripting and automation using Python and Bash
  • Experience deploying and maintaining services in public cloud environments (Azure, AWS, or GCP)
  • Solid understanding of observability tooling, especially Datadog, ELK Stack (Elasticsearch, Logstash, Kibana), or similar
  • Experience building and maintaining CI/CD pipelines (e.g., GitHub Actions, Azure DevOps, Octopus)
  • Familiarity with Kubernetes and Docker; production experience is a strong plus
  • Experience operating and scaling distributed systems across multiple regions
  • Strong communication and collaboration skills; comfortable working across time zones
  • Passion for learning, continuous improvement, and a strong sense of ownership
  • Fluent in English, both written and spoken.

Optimizely is committed to a diverse and inclusive workplace. Optimizely is an equal opportunity employer and does not discriminate on the basis of race, national origin, gender, gender identity, sexual orientation, protected veteran status, disability, age, or other legally protected status.

 

#LI-SR1