Worth Reading: Ops is everyone’s job now

Distributed systems are never “up”; they exist in a constant state of partially degraded service. Accept failure, design for resiliency, protect and shrink the critical path. You can’t hold the entire system in your head or reason about it; you will live or die by the thoroughness of your instrumentation and observability tooling. You need robust service registration and discovery, load balancing, and backpressure between every combination of components. You need to learn to integrate third-party services; many core functions will be outsourced to teams or companies that you have no direct visibility into or influence upon You have to test in production, and you have to do so safely; you cannot spin up a staging copy of a large distributed system. —Charity Majors @ opensource