Site Reliability Engineer II

Date: Apr 3, 2024

Location: Hanoi, VN

Company: Optimizely

At Optimizely, we're on a mission to help people unlock their digital potential. We do that by reinventing how marketing and product teams work to create and optimize digital experiences across all channels. With Optimizely One, our industry-first operating system for marketers, we offer teams flexibility and choice to build their stack their way with our fully SaaS, fully decoupled, and highly composable solution.  

We are proud to help more than 10,000 businesses, including H&M, PayPal, Zoom, and Toyota, enrich their customer lifetime value, increase revenue and grow their brands. Our innovation and excellence have earned us numerous recognitions as a leader by industry analysts such as Gartner, Forrester, and IDC, reinforcing our role as a trailblazer in MarTech. 

At our core, we believe work is about more than just numbers -- it's about the people. Our culture is dynamic and constantly evolving, shaped by every employee, their actions and their stories. With over 1500 Optimizers spread across 12 global locations, our diverse team embodies the "One Optimizely" spirit, emphasizing collaboration and continuous improvement, while fostering a culture where every voice is heard and valued. 

Join us and become part of a company that's empowering people to unlock their digital potential! 

Introduction

Reliability Engineering is a rapidly growing part within the organization. We are in the process of building our teams, tools and systems as part of our mission to build the leading digital experience platform.

 

We enable Optimizely to go fast by providing real time feedback on production systems. We work side by side with the product family and platform developers to maintain and improve services and performance. We live the company values (Dependable, Collaborative and Simple) with a strong customer focus and possess a healthy sense of urgency. We are a heavily data driven team, utilising a variety of data collection, enrichment, analytics and visualisations to learn about our complex systems.

 

We also live the 'Play, as a team' value by having a strong focus on sharing learning experiences from the front line with the development teams. So, the options for people in the team are vast. If you like mastering a domain and going deep, we need you. If you can juggle three tasks and coordinate multiple people in the heat of an incident, we need you. If you love the benefits of process and methodical improvement, you will love it here. If you want to keep your head down, headphones on and bash out code to support the team, we have a spot for you too.

 

As an SRE in one of our teams, you will work to enhance availability, performance and stability of Optimizely services as well as automating away repetitive work.

 

You'll also respond to pings, pages and alerts to investigate issues in our products that you can really sink your teeth into. You'll be working on non-production and production environments, monitoring, data collection and configuration management, as well as disaster recovery planning, capacity engineering, reliability improvement initiatives and platform automation. 

Job Responsibilities

  • Engage in the entire lifecycle of services—from inception through operation and decomisison.
  • Identify areas of improvement within our systems and perform enhancements
  • Reduce the impact of errors and automate repetitive tasks
  • Maintain services by measuring and monitoring availability, latency and overall system health
  • Author and maintain documentation for related processes, procedures and system events
  • Serve as level 3 support resource for systems the team is responsible for.
  • Troubleshoot and resolve end-user issues independently and efficiently
  • Build knowledge base around common production support issues
  • Troubleshoot and fix the system when it breaks
  • Drive Root cause analysis and corrective action completion to help eliminate disruption of services and consequently to improve the day-to-day operations of the organization 
  • Share the responsibility of being on-call

Knowledge and Experience

  • Expert level troubleshooting skills across different levels of the stack
  • Scripting and software development across one or more programming languages (Powershell / Bash / Python)
  • Good understanding of cloud architecture both in Windows- and Linux based systems
  • Hands on experience with cloud infrastructure such as Azure or AWS minimum of 2 years
  • Deep expertise in monitoring distributed systems application architectures
  • Exposure to and maintenance of CICD and orchestration tools at scale (Azure Automation, Octopus Deploy, Salt, Puppet, Chef etc.)
  • Diagnosing and troubleshooting user facing service outages
  • Exposure to system and application level telemetry for large distributed cloud architectures
  • Diagnosing and resolving problems in high-throughput web applications and network services

Education

Bachelor’s Degree (Computer Science or engineering preferred) or equivalent work experience

Competencies

Displaying Technical Expertise
Critical Thinking
Testing and Troubleshooting
Demonstrating Initiative
Utilizing Feedback

Optimizely is committed to a diverse and inclusive workplace. Optimizely is an equal opportunity employer and does not discriminate on the basis of race, national origin, gender, gender identity, sexual orientation, protected veteran status, disability, age, or other legally protected status.

 

#LI-SR1