Site Reliability Engineering Quotes

2,875 ratings, 4.21 average rating, 271 reviews
Open Preview
Site Reliability Engineering Quotes
Showing 61-90 of 79
“When standard operating procedures break down, they’ll need to be able to improvise fully.”
― Site Reliability Engineering: How Google Runs Production Systems
― Site Reliability Engineering: How Google Runs Production Systems
“At scale, there will be anomalies that are hard to detect, so they’ll need the ability to think statistically, rather than procedurally, to uncloak problems.”
― Site Reliability Engineering: How Google Runs Production Systems
― Site Reliability Engineering: How Google Runs Production Systems
“In the course of their jobs, they will come across systems they’ve never seen before, so they need to have strong reverse engineering skills.”
― Site Reliability Engineering: How Google Runs Production Systems
― Site Reliability Engineering: How Google Runs Production Systems
“Which backends of this server are considered “in the critical path,” and why? What aspects of this server could be simplified or automated? Where do you think the first bottleneck is in this architecture? If that bottleneck were to be saturated, what steps could you take to alleviate it?”
― Site Reliability Engineering: How Google Runs Production Systems
― Site Reliability Engineering: How Google Runs Production Systems
“Making the jump from a previous company or university, while changing job roles (from traditional software engineer or traditional systems administrator) to this nebulous Site Reliability Engineer role is often enough to knock students’ confidence down several times. For more introspective personalities (especially regarding questions #2 and #3), the uncertainties incurred by nebulous or less-than-clear answers can lead to slower development or retention problems.”
― Site Reliability Engineering: How Google Runs Production Systems
― Site Reliability Engineering: How Google Runs Production Systems
“Hiring SREs well is critical to having a high-functioning reliability organization, as explored in “Hiring Site Reliability Engineers” [Jon15]. Google’s hiring practices have been detailed in texts like Work Rules! [Boc15],1 but hiring SREs has its own set of particularities. Even by Google’s overall standards, SRE candidates are difficult to find and even harder to interview effectively.”
― Site Reliability Engineering: How Google Runs Production Systems
― Site Reliability Engineering: How Google Runs Production Systems
“Any organization that aspires to be serious about running an effective SRE arm needs to consider training. Teaching SREs how to think in a complicated and fast-changing environment with a well-thought-out and well-executed training program has the promise of instilling best practices within a new hire’s first few weeks or months that otherwise would take months or years to accumulate.”
― Site Reliability Engineering: How Google Runs Production Systems
― Site Reliability Engineering: How Google Runs Production Systems
“The Role of the Launch Coordination Engineer Our Launch Coordination Engineering team is composed of Launch Coordination Engineers (LCEs), who are either hired directly into this role, or are SREs with hands-on experience running Google services. LCEs are held to the same technical requirements as any other SRE, and are also expected to have strong communication and leadership skills”
― Site Reliability Engineering: How Google Runs Production Systems
― Site Reliability Engineering: How Google Runs Production Systems
“The launch cycle at Internet companies is markedly different. Launches and rapid iterations are far easier because new features can be rolled out on the server side, rather than requiring software rollout on individual customer workstations.”
― Site Reliability Engineering: How Google Runs Production Systems
― Site Reliability Engineering: How Google Runs Production Systems
“It’s important to establish credibility by delivering some product of value in a reasonable amount of time. Your first round of products should aim for relatively straightforward and achievable targets — ones without controversy or existing solutions. We”
― Site Reliability Engineering: How Google Runs Production Systems
― Site Reliability Engineering: How Google Runs Production Systems
“Don’t be afraid to provide white glove customer support for early adopters to help them through the onboarding process. Sometimes automation also entails a host of emotional concerns, such as fear that someone’s job will be replaced by a shell script. By working one-on-one with early users, you can address those fears personally, and demonstrate that rather than owning the toil of performing a tedious task manually, the team instead owns the configurations, processes, and ultimate results of their technical work. Later adopters are convinced by the happy examples of early adopters.”
― Site Reliability Engineering: How Google Runs Production Systems
― Site Reliability Engineering: How Google Runs Production Systems
“To this end, Google always strives to staff its SRE teams with a mix of engineers with traditional software development experience and engineers with systems engineering experience.”
― Site Reliability Engineering: How Google Runs Production Systems
― Site Reliability Engineering: How Google Runs Production Systems
“In order to remain reliable and to avoid scaling the number of SREs supporting a service linearly, the production environment has to run mostly unattended. To remain unattended, the environment must be resilient against minor faults.”
― Site Reliability Engineering: How Google Runs Production Systems
― Site Reliability Engineering: How Google Runs Production Systems
“the SRE Way in mind: thoroughness and dedication, belief in the value of preparation and documentation, and an awareness of what could go wrong, coupled with a strong desire to prevent it.”
― Site Reliability Engineering: How Google Runs Production Systems
― Site Reliability Engineering: How Google Runs Production Systems
“Automation is a force multiplier, not a panacea.”
― Site Reliability Engineering: How Google Runs Production Systems
― Site Reliability Engineering: How Google Runs Production Systems
“100% is the wrong reliability target for basically everything (pacemakers and anti-lock brakes being notable exceptions).”
― Site Reliability Engineering: How Google Runs Production Systems
― Site Reliability Engineering: How Google Runs Production Systems
“Software engineering has this in common with having children: the labor before the birth is painful and difficult, but the labor after the birth is where you actually spend most of your effort. Yet”
― Site Reliability Engineering: How Google Runs Production Systems
― Site Reliability Engineering: How Google Runs Production Systems
“When an engineer with years of familiarity in a problem space begins designing a product, it’s easy to imagine a utopian end-state for the work. However, it’s important to differentiate aspirational goals of the product from minimum success criteria (or Minimum Viable Product). Projects can lose credibility and fail by promising too much,”
― Site Reliability Engineering: How Google Runs Production Systems
― Site Reliability Engineering: How Google Runs Production Systems
“Putting alerts into email and hoping that someone will read all of them and notice the important ones is the moral equivalent of piping them to /dev/null: they will eventually be ignored.”
― Site Reliability Engineering: How Google Runs Production Systems
― Site Reliability Engineering: How Google Runs Production Systems