Share this Job

Site Reliability Engineer

Date: Aug 3, 2022

Location: Berlin, DE, 10176

Company: Optimizely

Optimizely is focused on unlocking the digital potential of our clients and our employees. We are the recognized category leader in Digital Experience Platform (DXP) and in Email marketing automation software. Just in DACH market, over 1,000 companies rely on us - such as Deutsche Bahn, TUI, Tchibo, myToys or Payback.

 

We live an inclusive and diverse culture with a global team of 1500+ people across the US, Europe, Australia, Bangladesh, and Vietnam. We blend European and American business culture with emphasis on teamwork, diversity, and moving fast. Our people make the difference!

 

If you are looking to work on the next generation of digital technologies in a fast-paced, hyper-growth environment, let’s have a conversation! We’re just getting started...

 

Introduction

Reliability Engineering is a rapidly growing part within the organization. We are in the process of building our teams, tools and systems as part of our mission to build the leading digital experience platform.

We enable Optimizely to go fast by providing real time feedback on production systems. We work side by side with the product family and platform developers to maintain and improve services and performance. We live the company values (Dependable, Collaborative and Simple) with a strong customer focus and possess a healthy sense of urgency. We are a heavily data driven team, utilising a variety of data collection, enrichment, analytics and visualisations to learn about our complex systems.

We also live the 'Play, as a team' value by having a strong focus on sharing learning experiences from the front line with the development teams. So, the options for people in the team are vast. If you like mastering a domain and going deep, we need you. If you can juggle three tasks and coordinate multiple people in the heat of an incident, we need you. If you love the benefits of process and methodical improvement, you will love it here. If you want to keep your head down, headphones on and bash out code to support the team, we have a spot for you too.

As an SRE in one of our teams, you will work to enhance availability, performance and stability of Optimizely services as well as automating away repetitive work.

You'll also respond to pings, pages and alerts to investigate issues in our products that you can really sink your teeth into. You'll be working on non-production and production environments, monitoring, data collection and configuration management, as well as disaster recovery planning, capacity engineering, reliability improvement initiatives and platform automation. 

Job Responsibilities

Engage in the entire lifecycle of services—from inception through operation and decomisison.
Identify areas of improvement within our systems and perform enhancements
Reduce the impact of errors and automate repetitive tasks
Maintain services by measuring and monitoring availability, latency and overall system health
Author and maintain documentation for related processes, procedures and system events


Serve as level 3 support resource for systems the team is responsible for.
Troubleshoot and resolve end-user issues independently and efficiently
Build knowledge base around common production support issues
Troubleshoot and fix the system when it breaks
Drive Root cause analysis and corrective action completion to help eliminate disruption of services and consequently to improve the day-to-day operations of the organization 
Share the responsibility of being on-call

Knowledge and Experience

Expert level troubleshooting skills across different levels of the stack
Scripting and software development across one or more programming languages (Powershell / Bash / Python)
Good understanding of cloud architecture both in Windows- and Linux based systems
Hands on experience with cloud infrastructure such as Azure or AWS minimum of 2 years
Deep expertise in monitoring distributed systems application architectures
Exposure to and maintenance of CICD and orchestration tools at scale (Azure Automation, Octopus Deploy, Salt, Puppet, Chef etc.)
Diagnosing and troubleshooting user facing service outages
Exposure to system and application level telemetry for large distributed cloud architectures
Diagnosing and resolving problems in high-throughput web applications and network services

Education

Bachelor’s Degree (Computer Science or engineering preferred) or equivalent work experience

Competencies

Displaying Technical Expertise
Critical Thinking
Testing and Troubleshooting
Demonstrating Initiative
Utilizing Feedback

Optimizely is committed to a diverse and inclusive workplace. Optimizely is an equal opportunity employer and does not discriminate on the basis of race, national origin, gender, gender identity, sexual orientation, protected veteran status, disability, age, or other legally protected status.

 

#LI-RS1