Site Reliability Engineer II

Date: Jan 31, 2024

Location: Hanoi, VN

Company: Optimizely

Optimizely is focused on unlocking the boundless potential of our clients and employees. We are a category leader in Digital Experience Platform (DXP) and have the pleasure of serving over 9,000 brands, from global organizations such as Visa, Sky, Yamaha, and Wall Street Journal to tech innovators like Atlassian DocuSign, FitBit, and Zillow.  

Optimizely fosters an inclusive and diverse culture with a global team of 1500+ people spread across the US, Europe, Dubai, Australia, Singapore, Bangladesh, and Vietnam. Our unique work environment focuses on flexibility, trust, teamwork, diversity, and moving fast.

We genuinely believe that our people make all the difference, and once we find the best talent, we go out of our way to nurture them.  If you are looking to work on the next generation of digital technologies in a fast-paced and growing environment with industry leaders, Optimizely is the place for you!


Reliability Engineering is a rapidly growing part within the organization. We are in the process of building our teams, tools and systems as part of our mission to build the leading digital experience platform.

We enable Optimizely to go fast by providing real time feedback on production systems. We work side by side with the product family and platform developers to maintain and improve services and performance. We live the company values (Dependable, Collaborative and Simple) with a strong customer focus and possess a healthy sense of urgency. We are a heavily data driven team, utilising a variety of data collection, enrichment, analytics and visualisations to learn about our complex systems.

We also live the 'Play, as a team' value by having a strong focus on sharing learning experiences from the front line with the development teams. So, the options for people in the team are vast. If you like mastering a domain and going deep, we need you. If you can juggle three tasks and coordinate multiple people in the heat of an incident, we need you. If you love the benefits of process and methodical improvement, you will love it here. If you want to keep your head down, headphones on and bash out code to support the team, we have a spot for you too.

As an SRE in one of our teams, you will work to enhance availability, performance and stability of Optimizely services as well as automating away repetitive work.

You'll also respond to pings, pages and alerts to investigate issues in our products that you can really sink your teeth into. You'll be working on non-production and production environments, monitoring, data collection and configuration management, as well as disaster recovery planning, capacity engineering, reliability improvement initiatives and platform automation. 

Job Responsibilities
  • Engage in the entire lifecycle of services—from inception through operation and decomisison.
  • Identify areas of improvement within our systems and perform enhancements
  • Reduce the impact of errors and automate repetitive tasks
  • Maintain services by measuring and monitoring availability, latency and overall system health
  • Author and maintain documentation for related processes, procedures and system events
  • Serve as level 3 support resource for systems the team is responsible for.
  • Troubleshoot and resolve end-user issues independently and efficiently
  • Build knowledge base around common production support issues
  • Troubleshoot and fix the system when it breaks
  • Drive Root cause analysis and corrective action completion to help eliminate disruption of services and consequently to improve the day-to-day operations of the organization 
  • Share the responsibility of being on-call


Knowledge and Experience
  • Expert level troubleshooting skills across different levels of the stack
  • Scripting and software development across one or more programming languages (Powershell / Bash / Python)
  • Good understanding of cloud architecture both in Windows- and Linux based systems
  • Hands on experience with cloud infrastructure such as Azure or AWS minimum of 2 years
  • Deep expertise in monitoring distributed systems application architectures
  • Exposure to and maintenance of CICD and orchestration tools at scale (Azure Automation, Octopus Deploy, Salt, Puppet, Chef etc.)
  • Diagnosing and troubleshooting user facing service outages
  • Exposure to system and application level telemetry for large distributed cloud architectures
  • Diagnosing and resolving problems in high-throughput web applications and network services

Bachelor’s Degree (Computer Science or engineering preferred) or equivalent work experience

About us

  • 5 working days /week with flexible working time and no overtime.
  • Annual luxury Kick-off vacation.
  • International, professional, creative working environment and talented teams Onsite opportunities in Europe and US.
  • Common cultural-sportive- art Clubs and activities, sponsored and/or supported by the
  • Company (Ex: Football, GYM, Swimming, Guitar, English...).
  • Powerful workstation: Core i7-9700, 16-32 GB RAM, 02 x QHD 2560x1440 monitors (2K resolution). 100% official salary during the probation period, 13th month salary, annual salary raises.
  • 12 days of annual leave and 3 days of company holidays (New Year eve 31/12, Juneteenth day 18/6, Work Anniversary)
  • Up to 03 extra paid-leave days per year.
  • A free “Hacking Day” per month for self-studying and researching any IT-related subjects.
  • Social, Health and Unemployed Insurance are based on 100% Gross salary and fully paid by Company. Extra bonus at $ 60 per special occasions (Birthday, Labor Day, National Day, Solar New year, Lunar New Year).
  • Lunch allowance at $30 per month.
  • Baby allowance for a child under 03 years old is $ 12 per month.
  • AON Premium Healthcare Insurance package for employees and their children up to 18 years old. Daily various foods, drink, and seasonal fresh fruits.

And many other benefits, let's join us to discover!