More on this book
Community
Kindle Notes & Highlights
by
Sam Newman
Read between
June 18 - July 12, 2019
book Agile Testing (Addison-Wesley) that helps categorize the different types of tests.
Mocking or Stubbing When I talk about stubbing downstream collaborators, I mean that we create a stub service that responds with canned responses to known requests from the service under test. For example, I might tell my stub points bank that when asked for the balance of customer 123, it should return 15,000. The test doesn’t care if the stub is called 0, 1, or 100 times. A variation on this is to use a mock instead of a stub.
When using a mock, I actually go further and make sure the call was made. If the expected call is not made, the test fails.
I use stubs far more than mocks for service tests.
I rarely use mocks for this sort of testing.
A Smarter Stub Service Normally for stub services I’ve rolled them myself. I’ve used everything from Apache or Nginx to embedded Jetty containers or even command-line-launched Python web servers used to launch stub servers for such test cases. I’ve probably reproduced the same work time and time again in creating these stubs.
Separating Deployment from Release One way in which we can catch more problems before they occur is to extend where we run our tests beyond the traditional predeployment steps. Instead, if we can deploy our software, and test it in situ prior to directing production loads against it, we can detect issues specific to a given environment.
Another example of this is what is called blue/green deployment. With blue/green, we have two copies of our software deployed at a time, but only one version of it is receiving real requests.
Canary Releasing With canary releasing, we are verifying our newly deployed software by directing amounts of production traffic against the system to see if it performs as expected.
We can also use techniques like blue/green deployment, where we deploy a new version of our software and test it in situ prior to directing our users to the new version.
Summary Bringing this all together, what I have outlined here is a holistic approach to testing that hopefully gives you some general guidance on how to proceed when testing your own systems. To reiterate the basics: Optimize for fast feedback, and separate types of tests accordingly. Avoid the need for end-to-end tests wherever possible by using consumer-driven contracts. Use consumer-driven contracts to provide focus points for conversations between teams. Try to understand the trade-off between putting more effort into testing and detecting issues faster in production (optimizing for MTBF
...more
Logs, Logs, and Yet More Logs… Now the number of hosts we are running on is becoming a challenge. SSH-multiplexing to retrieve logs probably isn’t going to cut it now, and there isn’t a screen big enough for you to have terminals open on every host. Instead, we’re looking to use specialized subsystems to grab our logs and make them available centrally. One example of this is logstash, which can parse multiple logfile formats and can send them to downstream systems for further investigation.
Synthetic Monitoring We can try to work out if a service is healthy by, for example, deciding what a good CPU level is, or what makes for an acceptable response time. If our monitoring system detects that the actual values fall outside this safe level, we can trigger an alert — something that a tool like Nagios is more than capable of.
Correlation IDs With a large number of services interacting to provide any given end-user capability, a single initiating call can end up generating multiple more downstream service calls.
One approach that can be useful here is to use correlation IDs. When the first call is made, you generate a GUID for the call. This is then passed along to all subsequent calls, as seen in Figure 8-5, and can be put into your logs in a structured way, much as you’ll already do with components like the log level or date. With the right log aggregation tooling, you’ll then be able to trace that event all the way through your system:
One of the real problems with correlation IDs is that you often don’t know you need one until after you already have a problem that could be diagnosed only if you had the ID at the beginning! This is especially problematic, as retrofitting correlation IDs in is very difficult; you need to handle them in a standardized way to be able to easily reconsititute call chains. Although it might seem like additional work up front, I would strongly suggest you consider putting them in as soon as you can, especially if your system will make use of event-driven architecture patterns, which can lead to
...more
The Cascade Cascading failures can be especially perilous. Imagine a situation where the network connection between our music shop website and the catalog service goes down. The services themselves appear healthy, but they can’t talk to each other.
Therefore, monitoring the integration points between systems is key. Each service instance should track and expose the health of its downstream dependencies, from the database to other collaborating services. You should also allow this information to be aggregated to give you a rolled-up picture. You’ll want to see the response time of the downstream calls, and also detect if it is erroring.
you can use libraries to implement a circuit breaker around network calls to help you handle cascading failures in a more elegant fashion, allowing you to more gracefully degrade your system. Some of these libraries, such as Hystrix for the JVM,
Riemann is an event server that allows for fairly advanced aggregation and routing of events and can form part of such a solution. Suro is Netflix’s data pipeline and operates in a similar space. Suro is explicitly used to handle both metrics associated with user behavior, and more operational data like application logs. This data can then be dispatched to a variety of systems, like Storm for real-time analysis, Hadoop for offline batch processing, or Kibana for log analysis.
Track inbound response time at a bare minimum. Once you’ve done that, follow with error rates and then start working on application-level metrics. Track the health of all downstream responses, at a bare minimum including the response time of downstream calls, and at best tracking error rates. Libraries like Hystrix can help here. Standardize on how and where metrics are collected. Log into a standard location, in a standard format if possible. Aggregation is a pain if every service uses a different layout! Monitor the underlying operating system so you can track down rogue processes and do
...more
hopefully I’ve given you enough to get started with. If you want to know more, I go into some of these ideas and more in my earlier publication Lightweight Systems for Realtime Monitoring (O’Reilly).
Common Single Sign-On Implementations A common approach to authentication and authorization is to use some sort of single sign-on (SSO) solution. SAML, which is the reigning implementation in the enterprise space, and OpenID Connect both provide capabilities in this area.
SAML is a SOAP-based standard, and is known for being fairly complex to work with despite the libraries and tooling available to support it.
OpenID Connect is a standard that has emerged as a specific implementation of OAuth 2.0, based on the way Google and others handle SSO. It uses simpler REST calls, and in my opinion is likely to make inroads into enterprises due to its improved ease of use.
Single Sign-On Gateway Within a microservice setup, each service could decide to handle the redirection to, and handshaking with, the identity provider. Obviously, this could mean a lot of duplicated work. A shared library could help, but we’d have to be careful to avoid the coupling that can come from shared code. This also wouldn’t help if you had multiple different technology stacks.
Rather than having each service manage handshaking with your identity provider, you can use a gateway to act as a proxy, sitting between your services and the outside world (as shown in Figure 9-1). The idea is that we can centralize the behavior for redirecting the user and perform the handshake in only one place.
we still need to solve the problem of how the downstream service receives information about principals, such as their username or what roles they play.
One final problem with this approach is that it can lull you into a false sense of security. I like the idea of defense in depth — from network perimeter, to subnet, to firewall, to machine, to operating system, to the underlying hardware. You have the ability to implement security measures at all of these points, some of which we’ll get into shortly. I have seen some people put all their eggs in one basket, relying on the gateway to handle every step for them. And we all know what happens when we have a single point of failure…
This opens up the possibility of allowing access to your services in different ways but keeping the same source of truth — for example, using SAML to authenticate humans for SSO, and using API keys for service-to-service communication,
There is a type of vulnerability called the confused deputy problem, which in the context of service-to-service communication refers to a situation where a malicious party can trick a deputy service into making calls to a downstream service on his behalf that he shouldn’t be able to. For example, as a customer, when I log in to the online shopping system, I can see my account details. What if I could trick the online shopping UI into making a request for someone else’s details, maybe by making a call with my logged-in credentials? In this example, what is to stop me from asking for orders that
...more
Intrusion Detection (and Prevention) System Intrusion detection systems (IDS) can monitor networks or hosts for suspicious behavior,
Understanding cross-functional requirements is all about considering aspects like durability of data, availability of services, throughput, and acceptable latency of services.
Knowing how much failure you can tolerate, or how fast your system needs to be, is driven by the users of your system. That in turn will help you understand which techniques will make the most sense for you. That said, your users won’t always be able to articulate what the exact requirements are. So you need to ask questions to help extract the right information, and help them understand the relative costs of providing different levels of service. As I mentioned previously, these cross-functional requirements can vary from service to service, but I would suggest defining some general
...more
Response time/latency How long should various operations take? It can be useful here to measure this with different numbers of users to understand how increasing load will impact the response time. Given the nature of networks, you’ll always have outliers, so setting targets for a given percentile of the responses monitored can be useful. The target should also include the number of concurrent connections/users you will expect your software to handle. So you might say, “We expect the website to have a 90th-percentile response time of 2 seconds when handling 200 concurrent connections per
...more
Can you expect a service to be down? Is this considered a 24/7 service? Some people like to look at periods of acceptable downtime when measuring availability, but how useful is this to someone calling your service? I should either be able to rely on your service responding or not. Measuring periods of downtime is really more useful from a historical reporting angle. Durability of data How much data loss is acceptable? How long should data be kept for? This is highly likely to change on a case-by-case basis.
You may decide to make use of performance tests, for example, to ensure your system meets acceptable performance targets, but you’ll want to make sure you are monitoring these stats in production as well!
Degrading Functionality An essential part of building a resilient system, especially when your functionality is spread over a number of different microservices that may be up or down, is the ability to safely degrade functionality. Let’s imagine a standard web page on our ecommerce site. To pull together the various parts of that website, we might need several microservices to play a part. One microservice might display the details about the album being offered for sale. Another might show the price and stock level. And we’ll probably be showing shopping cart contents too, which may be yet
...more
The right thing to do in any situation is often not a technical decision. We might know what is technically possible when the shopping cart is down, but unless we understand the business context we won’t understand what action we should be taking.
We ended up implementing three fixes to avoid this happening again: getting our timeouts right, implementing bulkheads to separate out different connection pools, and implementing a circuit breaker to avoid sending calls to an unhealthy system in the first place.
The Antifragile Organization In his book Antifragile (Random House), Nassim Taleb talks about things that actually benefit from failure and disorder. Ariel Tseitlin used this concept to coin the concept of the antifragile organization in regards to how Netflix operates. The scale at which Netflix operates is well known, as is the fact that Netflix is based entirely on the AWS infrastructure. These two factors mean that it has to embrace failure well. Netflix goes beyond that by actually inciting failure to ensure that its systems are tolerant of it. Some organizations would be happy with game
...more
This highlight has been truncated due to consecutive passage length restrictions.
With a circuit breaker, after a certain number of requests to the downstream resource have failed, the circuit breaker is blown. All further requests fail fast while the circuit breaker is in its blown state. After a certain period of time, the client sends a few requests through to see if the downstream service has recovered, and if it gets enough healthy responses it resets the circuit breaker.
Getting the settings right can be a little tricky. You don’t want to blow the circuit breaker too readily, nor do you want to take too long to blow it. Likewise, you really want to make sure that the downstream service is healthy again before sending traffic. As with timeouts, I’d pick some sensible defaults and stick with them everywhere, then change them for specific cases.
While the circuit breaker is blown, you have some options. One is to queue up the requests and retry them later on.
this call is being made as part of a synchronous call chain, however, it is probably better to fail fast.