A combination of introduction to operating system and computer network
-
Updated
Feb 2, 2017
Site reliability engineering (SRE) is a set of principles and practices that incorporates aspects of software engineering and applies them to infrastructure and operations problems. The main goals are to create scalable and highly reliable software systems. Site reliability engineering is closely related to DevOps, a set of practices that combine software development and IT operations, and SRE has also been described as a specific implementation of DevOps.
A combination of introduction to operating system and computer network
Google Site Reliability Engineering book converted in audio
Control health checks and toggle upstream node status in load balancers with ease.
Dev environment for SRE
[INACTIVE] Terraform provider for Arachnys' Cabot. Create, manage, and manipulate status checks, and alerts for services.
The agent of Komlog, a PaaS for helping observability teams to better understand their systems.
Endpoint monitoring and DNS failover agent written in Go
Great resources for learning Software and Site Reliability Engineering.
Calculate the tolerable downtime of your service
A collection templates ported from the SRE Workbook
A party card game for engineers caring about reliability. Based on Cards Against Humanity.
Script to monitor the Azure Traffic Manager service.
Overall map of topics to cover for my “Engineering for Site Reliability” blog series.
Holberton's DevOps/ SRE curriculum. Projects and code that focus on Bash scripting, system design, automation, web infrastructure, web servers, Linux, Vagrant, and Vim. View the READMEs inside for more descriptions of each
Maia is a CLI that allows you to execute remote commands on multiple machines at once.
External Node Classifier written in Go
A collection of SRE tools
💯网站可靠性和生产工程资源精选清单
The Skinny Distributed Lock Service