Site Reliability Engineer General Summary:
Site reliability engineers (SREs) are responsible for improving system reliability and resilience to make it faster and easier to develop and deploy new software capabilities. SREs focus especially on building automation to reduce manual effort and prevent operating incidents.
Principal Duties and Responsibilities:
- Works with stakeholders such as product owners to define service level objectives (SLOs) for system operations. Track performance against SLOs in partnership with monitoring teams or other stakeholders, and ensure systems continue to meet SLOs over time.
- Collaborates with software developers, engineers, and operations teams on opportunities to improve performance and stability of applications and systems.
- Creates dashboards and reports to communicate key metrics.
- Performs updates to application software to improve performance, scalability, and stability of systems.
- Designs, codes, tests, and delivers software to automate manual operational work.
- Participates in operational support and on-call rotation shifts for supported systems and products.
- Performs analytics on previous incidents to understand root causes and better predict and prevent future issues.
- Identifies, evaluates, and recommends monitoring tools and diagnostic techniques to improve system observability.
- Remains current on site reliability engineering methods and trends such as observability-driven development and chaos engineering. Drive continuous improvement in software quality and infrastructure reliability and resilience.
- Oversees, design, implement, and manage DevOps capabilities using continuous integration/continuous delivery toolsets and automation.
- Understands and implements the governance, assurance and standards activities associated with FHLB policies and procedures.
- Performs other duties as needed to support the team and the business.
Minimum Knowledge, Skills and Abilities Required:
- Knowledge at a level normally acquired through completion of a Bachelor's Degree in Computer Science, Information Technology, or a related study, or 4 years equivalent experience.
- Ability to collaborate in a team environment, and able to adapt effectively and quickly to a rapidly changing highly regulated environment.
- Advanced analytical and problem-solving skills to identify research and resolve server problems effectively and efficiently.
- Strong verbal and written communication skills.
- 3+ years of experience with programming and scripting languages (e.g. Java, C#, C++, Python, Bash, PowerShell).
- 3+ years of experience with incident and response management.
- Exposure to Agile and DevOps development methodologies.
- Experience with working in cloud ecosystems, preferably Microsoft Azure.
- Exposure to monitoring and observability tools (e.g. Dynatrace, Splunk, Cloudwatch, NewRelic, ELK, Prometheus, OpenTelemetry).
- Exposure to configuration management systems (e.g. Puppet, Ansible, Chef, Salt, Terraform).
- Exposure to continuous integration/continuous deployment tools (e.g. Git, Teamcity, Jenkins, Artifactory).
- Demonstrates a commitment to diversity and inclusion. Promotes an environment of empathy and respect, ensures the inclusion of all team members, and will actively engage in D&I events and learning opportunities.
Working Conditions:
Requires daily interaction with system and networking hardware and software using PCs/Servers for majority of duties. Exposed to moderate noise volume when working in the server room. Requires lifting and moving equipment of approximately 30 lbs. to move or install switches, routers or related equipment and configurations. Must be able to quickly respond to problems that affect production up time, occasionally requiring work outside normal Bank hours (i.e. weekends, evenings or early mornings). Notation: This position has been identified as "high risk" as outlined in the Bank's Background Check policy. Individuals occupying this position will be required to submit to a background check biennially. Such repeat background check(s) are considered a "condition of continued employment".
|