<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>SRE on Chen Kai Blog</title><link>https://www.chenk.top/en/tags/sre/</link><description>Recent content in SRE on Chen Kai Blog</description><generator>Hugo</generator><language>en</language><lastBuildDate>Fri, 26 May 2023 09:00:00 +0000</lastBuildDate><atom:link href="https://www.chenk.top/en/tags/sre/index.xml" rel="self" type="application/rss+xml"/><item><title>Cloud Computing (7): Cloud Operations and DevOps Practices</title><link>https://www.chenk.top/en/cloud-computing/operations-devops/</link><pubDate>Fri, 26 May 2023 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/cloud-computing/operations-devops/</guid><description>&lt;p>&lt;figure class="article-figure">
 &lt;img src="https://blog-pic-ck.oss-cn-beijing.aliyuncs.com/posts/en/cloud-computing/operations-devops/illustration_1.png" alt="Cloud Computing (7): Cloud Operations and DevOps Practices — Chapter overview" loading="lazy" decoding="async" class="content-image">
 
&lt;/figure>
&lt;/p>
&lt;p>In 2017 GitLab lost six hours of database state. An engineer, exhausted, ran &lt;code>rm -rf&lt;/code> on the wrong server during an incident. The backup procedures had silently been broken for months; nobody noticed because no one was restoring from backups. The lesson is not &amp;ldquo;be careful with rm&amp;rdquo;. The lesson is that operations is a &lt;em>system&lt;/em> — tools, runbooks, monitoring, automation, and the rituals around them. When the system is healthy, no single tired engineer can take down production. When the system is rotten, every late-night fix is one keystroke from disaster.&lt;/p></description></item></channel></rss>