More on this book
Community
Kindle Notes & Highlights
by
Sam Newman
Read between
November 1 - December 24, 2017
blue/green, we have two copies of our software deployed at a time, but only one version of it is receiving real requests.
With canary releasing, we are verifying our newly deployed software by directing amounts of production traffic against the system to see if it performs as expected.
Many, if not most, CFRs can really only be met in production.
Due to the time it takes to run performance tests, it isn’t always feasible to run them on every check-in.
make the build go red or green based on the results, with a red (failing) build being a clear call to action.
monitor the small things, and use aggregation to see the bigger picture.
First, we’ll want to monitor the host itself. CPU, memory
Next, we’ll want to have access to the logs from the server itself.
Finally, we might want to monitor the application itself.
to a single
alert
multiplexers, which allow us to run the same commands on multiple hosts. A big monitor and
I would strongly suggest having your services expose basic metrics themselves.
Each service instance should track and expose the health of its downstream dependencies, from the database to other collaborating services.
You definitely want to have all your metrics in one place, and you may want to have a list of standard names for your metrics too;
Alert on the things they need to know right now. Create big visible displays with this information that sit in the corner of the room.
This data can then be dispatched to a variety of systems, like Storm for real-time analysis, Hadoop for offline batch processing, or Kibana for log analysis.
If you build this thinking into everything you do, and plan for failure, you can make different trade-offs.
Understanding cross-functional requirements is all about considering aspects like durability of data, availability of services, throughput, and acceptable latency of services.
Response time/latency
Availability
Durability of data
An essential part of building a resilient system, especially when your functionality is spread over a number of different microservices that may be up or down, is the ability to safely degrade functionality.
What we need to do is understand the impact of each outage, and work out how to properly degrade functionality.
architectural safety measures,
Netflix has made these tools available under an open source license.
Put timeouts on all out-of-process calls, and pick a default timeout for everything. Log when timeouts occur, look at what happens, and change them accordingly.
With a circuit breaker, after a certain number of requests to the downstream resource have failed, the circuit breaker is blown. All further requests fail fast while the circuit breaker is in its blown state. After a certain period of time, the client sends a few requests through to see if the downstream service has recovered, and if it gets enough healthy responses it resets the circuit breaker.
How you implement a circuit breaker depends on what a failed request means,
Separation of concerns can also be a way to implement bulkheads.
I’d recommend mandating circuit breakers for all your synchronous downstream calls.
Hystrix allows you, for example, to implement bulkheads that actually reject requests in certain conditions to ensure that resources don’t become even more saturated; this is known as load shedding.
Some of the HTTP verbs, such as GET and PUT, are defined in the HTTP specification to be idempotent,
the configuration of a load balancer, treat it as you treat the configuration of your service: make sure it is stored in version control and can be applied automatically.
primary node, but distribute reads to one or more read replicas,
Reads are comparatively easy to scale. What about writes? One approach is to use sharding.
data. To pick a very simplistic (and actually bad) example, imagine that customer records A–M go to one database instance, and N–Z another. You
Often querying across shards is handled by an asynchronous mechanism, using cached results. Mongo uses map/reduce jobs, for example, to perform these queries.
more systems support adding extra shards to a live system, where the rebalancing of data happens in the background; Cassandra, for example, handles this very well.
With proxy caching, a proxy is placed between the client and the server. A great example of this is using a reverse proxy or content delivery network (CDN). With
proxy like Squid or Varnish,
all comes down to knowing what load you need to handle, how fresh your data needs to be, and what your system can do right now.
First, with HTTP, we can use cache-control directives in our responses to clients.
These tell clients if they should cache the resource at all, and if so how long they should cache it for in seconds.
Expires h...
This highlight has been truncated due to consecutive passage length restrictions.
ETag is used to determine if the value of a resource has changed.
we already have the up-to-date version, the service sends us a 304 Not Modified response, telling us we have the latest version. If there is a newer version available, we get a 200 OK
write-behind cache,
you could also have the scaling triggered by well-known trends.