Cloud Computing (7): Cloud Operations and DevOps Practices

Fri, 26 May 2023 09:00:00 +0000

In 2017 GitLab lost six hours of database state. An engineer, exhausted, ran rm -rf on the wrong server during an incident. The backup procedures had silently been broken for months; nobody noticed because no one was restoring from backups. The lesson is not “be careful with rm”. The lesson is that operations is a system — tools, runbooks, monitoring, automation, and the rituals around them. When the system is healthy, no single tired engineer can take down production. When the system is rotten, every late-night fix is one keystroke from disaster.

SRE on Chen Kai Blog

Cloud Computing (7): Cloud Operations and DevOps Practices