Site Reliability Engineering

How Google Runs Production Systems

Betsy Beyer, Chris Jones, Jennifer Petoff, Niall Richard Murphy (Autoren)

Buch | Softcover

552 Seiten

2016
O'Reilly Media (Verlag)
978-1-4919-2912-4 (ISBN)

Artikel merken

The overwhelming majority of a software system’s lifespan is spent in use, not in design or implementation. So, why does conventional wisdom insist that software engineers focus primarily on the design and development of large-scale computing systems?

In this collection of essays and articles, key members of Google’s Site Reliability Team explain how and why their commitment to the entire lifecycle has enabled the company to successfully build, deploy, monitor, and maintain some of the largest software systems in the world.

You’ll learn the principles and practices that enable Google engineers to make systems more scalable, reliable, and efficient—lessons directly applicable to your organization.

This book is divided into four sections:
Introduction — Learn what site reliability engineering is and why it differs from conventional IT industry practices
Principles — Examine the patterns, behaviors, and areas of concern that influence the work of a site reliability engineer (SRE)
Practices — Understand the theory and practice of an SRE’s day-to-day work: building and operating large distributed computing systems
Management — Explore Google's best practices for training, communication, and meetings that your organization can use

Chris Jones is a Site Reliability Engineer for Google App Engine, a cloud platform-as-a-service product serving over 28 billion requests per day. Based in San Francisco, he has previously been responsible for the care and feeding of Google's advertising statistics, data warehousing, and customer support systems. In other lives, Chris has worked in academic IT, analyzed data for political campaigns, and engaged in some light BSD kernel hacking, picking up degrees in Computer Engineering, Economics, and Technology Policy along the way. He's also a licensed professional engineer.

Niall Murphy leads the Ads Site Reliability Engineering team at Google Ireland. He has been involved in the Internet industry for about 20 years, and is currently chairperson of INEX, Ireland's peering hub. He is the author or co-author of a number of technical papers and/or books, including "IPv6 Network Administration" for O' Reilly, and a number of RFCs. He is currently co-writing a history of the Internet in Ireland, and he is the holder of degrees in Computer Science, Mathematics, and Poetry Studies, which is surely some kind of mistake. He lives in Dublin with his wife and two sons.

Jennifer Petoff is a Program Manager for Google’s Site Reliability Engineering team and based in Dublin, Ireland. She has managed large global projects across wide-ranging domains including scientific research, engineering, human resources, and advertising operations. Jennifer joined Google after spending eight years in the chemical industry. She holds a PhD in Chemistry from Stanford University and a BS in Chemistry and a BA in Psychology from the University of Rochester.

Betsy Beyer is a Technical Writer for Google in New York City specializing in Site Reliability Engineering. She has previously written documentation for Google’s Data Center and Hardware Operations Teams in Mountain View and across its globally distributed datacenters. Before moving to New York, Betsy was a lecturer on technical writing at Stanford University. En route to her current career, Betsy studied International Relations and English Literature, and holds degrees from Stanford and Tulane.

Introduction
Chapter 1Introduction
The Sysadmin Approach to Service Management
Google’s Approach to Service Management: Site Reliability Engineering
Tenets of SRE
The End of the Beginning
Chapter 2The Production Environment at Google, from the Viewpoint of an SRE
Hardware
System Software That “Organizes” the Hardware
Other System Software
Our Software Infrastructure
Our Development Environment
Shakespeare: A Sample Service
Principles
Chapter 3Embracing Risk
Managing Risk
Measuring Service Risk
Risk Tolerance of Services
Motivation for Error Budgets
Chapter 4Service Level Objectives
Service Level Terminology
Indicators in Practice
Objectives in Practice
Agreements in Practice
Chapter 5Eliminating Toil
Toil Defined
Why Less Toil Is Better
What Qualifies as Engineering?
Is Toil Always Bad?
Conclusion
Chapter 6Monitoring Distributed Systems
Definitions
Why Monitor?
Setting Reasonable Expectations for Monitoring
Symptoms Versus Causes
Black-Box Versus White-Box
The Four Golden Signals
Worrying About Your Tail (or, Instrumentation and Performance)
Choosing an Appropriate Resolution for Measurements
As Simple as Possible, No Simpler
Tying These Principles Together
Monitoring for the Long Term
Conclusion
Chapter 7The Evolution of Automation at Google
The Value of Automation
The Value for Google SRE
The Use Cases for Automation
Automate Yourself Out of a Job: Automate ALL the Things!
Soothing the Pain: Applying Automation to Cluster Turnups
Borg: Birth of the Warehouse-Scale Computer
Reliability Is the Fundamental Feature
Recommendations
Chapter 8Release Engineering
The Role of a Release Engineer
Philosophy
Continuous Build and Deployment
Configuration Management
Conclusions
Chapter 9Simplicity
System Stability Versus Agility
The Virtue of Boring
I Won’t Give Up My Code!
The “Negative Lines of Code” Metric
Minimal APIs
Modularity
Release Simplicity
A Simple Conclusion
Practices
Chapter 10Practical Alerting from Time-Series Data
The Rise of Borgmon
Instrumentation of Applications
Collection of Exported Data
Storage in the Time-Series Arena
Rule Evaluation
Alerting
Sharding the Monitoring Topology
Black-Box Monitoring
Maintaining the Configuration
Ten Years On…
Chapter 11Being On-Call
Introduction
Life of an On-Call Engineer
Balanced On-Call
Feeling Safe
Avoiding Inappropriate Operational Load
Conclusions
Chapter 12Effective Troubleshooting
Theory
In Practice
Negative Results Are Magic
Case Study
Making Troubleshooting Easier
Conclusion
Chapter 13Emergency Response
What to Do When Systems Break
Test-Induced Emergency
Change-Induced Emergency
Process-Induced Emergency
All Problems Have Solutions
Learn from the Past. Don’t Repeat It.
Conclusion
Chapter 14Managing Incidents
Unmanaged Incidents
The Anatomy of an Unmanaged Incident
Elements of Incident Management Process
A Managed Incident
When to Declare an Incident
In Summary
Chapter 15Postmortem Culture: Learning from Failure
Google’s Postmortem Philosophy
Collaborate and Share Knowledge
Introducing a Postmortem Culture
Conclusion and Ongoing Improvements
Chapter 16Tracking Outages
Escalator
Outalator
Chapter 17Testing for Reliability
Types of Software Testing
Creating a Test and Build Environment
Testing at Scale
Conclusion
Chapter 18Software Engineering in SRE
Why Is Software Engineering Within SRE Important?
Auxon Case Study: Project Background and Problem Space
Intent-Based Capacity Planning
Fostering Software Engineering in SRE
Conclusions
Chapter 19Load Balancing at the Frontend
Power Isn’t the Answer
Load Balancing Using DNS
Load Balancing at the Virtual IP Address
Chapter 20Load Balancing in the Datacenter
The Ideal Case
Identifying Bad Tasks: Flow Control and Lame Ducks
Limiting the Connections Pool with Subsetting
Load Balancing Policies
Chapter 21Handling Overload
The Pitfalls of “Queries per Second”
Per-Customer Limits
Client-Side Throttling
Criticality
Utilization Signals
Handling Overload Errors
Load from Connections
Conclusions
Chapter 22Addressing Cascading Failures
Causes of Cascading Failures and Designing to Avoid Them
Preventing Server Overload
Slow Startup and Cold Caching
Triggering Conditions for Cascading Failures
Testing for Cascading Failures
Immediate Steps to Address Cascading Failures
Closing Remarks
Chapter 23Managing Critical State: Distributed Consensus for Reliability
Motivating the Use of Consensus: Distributed Systems Coordination Failure
How Distributed Consensus Works
System Architecture Patterns for Distributed Consensus
Distributed Consensus Performance
Deploying Distributed Consensus-Based Systems
Monitoring Distributed Consensus Systems
Conclusion
Chapter 24Distributed Periodic Scheduling with Cron
Cron
Cron Jobs and Idempotency
Cron at Large Scale
Building Cron at Google
Summary
Chapter 25Data Processing Pipelines
Origin of the Pipeline Design Pattern
Initial Effect of Big Data on the Simple Pipeline Pattern
Challenges with the Periodic Pipeline Pattern
Trouble Caused By Uneven Work Distribution
Drawbacks of Periodic Pipelines in Distributed Environments
Introduction to Google Workflow
Stages of Execution in Workflow
Ensuring Business Continuity
Summary and Concluding Remarks
Chapter 26Data Integrity: What You Read Is What You Wrote
Data Integrity’s Strict Requirements
Google SRE Objectives in Maintaining Data Integrity and Availability
How Google SRE Faces the Challenges of Data Integrity
Case Studies
General Principles of SRE as Applied to Data Integrity
Conclusion
Chapter 27Reliable Product Launches at Scale
Launch Coordination Engineering
Setting Up a Launch Process
Developing a Launch Checklist
Selected Techniques for Reliable Launches
Development of LCE
Conclusion
Management
Chapter 28Accelerating SREs to On-Call and Beyond
You’ve Hired Your Next SRE(s), Now What?
Initial Learning Experiences: The Case for Structure Over Chaos
Creating Stellar Reverse Engineers and Improvisational Thinkers
Five Practices for Aspiring On-Callers
On-Call and Beyond: Rites of Passage, and Practicing Continuing Education
Closing Thoughts
Chapter 29Dealing with Interrupts
Managing Operational Load
Factors in Determining How Interrupts Are Handled
Imperfect Machines
Chapter 30Embedding an SRE to Recover from Operational Overload
Phase 1: Learn the Service and Get Context
Phase 2: Sharing Context
Phase 3: Driving Change
Conclusion
Chapter 31Communication and Collaboration in SRE
Communications: Production Meetings
Collaboration within SRE
Case Study of Collaboration in SRE: Viceroy
Collaboration Outside SRE
Case Study: Migrating DFP to F1
Conclusion
Chapter 32The Evolving SRE Engagement Model
SRE Engagement: What, How, and Why
The PRR Model
The SRE Engagement Model
Production Readiness Reviews: Simple PRR Model
Evolving the Simple PRR Model: Early Engagement
Evolving Services Development: Frameworks and SRE Platform
Conclusion
Conclusions
Chapter 33Lessons Learned from Other Industries
Meet Our Industry Veterans
Preparedness and Disaster Testing
Postmortem Culture
Automating Away Repetitive Work and Operational Overhead
Structured and Rational Decision Making
Conclusions
Chapter 34Conclusion
Appendix Availability Table
Appendix A Collection of Best Practices for Production Services
Fail Sanely
Progressive Rollouts
Define SLOs Like a User
Error Budgets
Monitoring
Postmortems
Capacity Planning
Overloads and Failure
SRE Teams
Appendix Example Incident State Document
Appendix Example Postmortem
Lessons Learned
Timeline
Supporting information:
Appendix Launch Coordination Checklist
Appendix Example Production Meeting Minutes

Erscheinungsdatum	14.04.2016
Verlagsort	Sebastopol
Sprache	englisch
Maße	150 x 250 mm
Gewicht	666 g
Einbandart	kartoniert
Themenwelt	Mathematik / Informatik ► Informatik ► Software Entwicklung
Themenwelt	Mathematik / Informatik ► Informatik ► Theorie / Studium
Schlagworte	Reliability • Skalierbarkeit • Softwareentwicklung • SRE • Zuverlässigkeit (Informat.)
ISBN-10	1-4919-2912-X / 149192912X
ISBN-13	978-1-4919-2912-4 / 9781491929124
Zustand	Neuware