Building Microservices: Designing Fine-Grained Systems
Rate it:
Kindle Notes & Highlights
Read between December 23, 2016 - January 2, 2017
42%
Flag icon
But even if we keep the number of hosts small, we still are going to have lots of services. That means multiple deployments to handle, services to monitor, logs to collect. Automation is essential.
Brian
Deployment automation is key, especially as number of services, hosts proliferate
47%
Flag icon
The reason we want to test a single service by itself is to improve the isolation of the test to make finding and fixing problems faster. To achieve this isolation, we need to stub out all external collaborators so only the service itself is in scope, as Figure 7-5 shows.
Brian
Set of tests that are higher level than unit (single method usually) but completely isolated from other services (what about separation from data stores?)
47%
Flag icon
Our service test suite needs to launch stub services for any downstream collaborators (or ensure they are running), and configure the service under test to connect to the stub services.
47%
Flag icon
When I talk about stubbing downstream collaborators, I mean that we create a stub service that responds with canned responses to known requests from the service under test.
47%
Flag icon
Sometimes, though, mocks can be very useful to ensure that the expected side effects happen. For example, I might want to check that when I create a customer, a new points balance is set up for that customer. The balance between stubbing and mocking calls is a delicate one, and is just as fraught in service tests as in unit tests. In general, though, I use stubs far more than mocks for service tests.
48%
Flag icon
We can deal with both of these problems elegantly by having multiple pipelines fan in to a single, end-to-end test stage. Here, whenever a new build of one of our services is triggered, we run our end-to-end tests, an example of which we can see in Figure 7-8.
48%
Flag icon
As test scope increases, so too does the number of moving parts. These moving parts can introduce test failures that do not show that the functionality under test is broken, but that some other problem has occurred.
48%
Flag icon
When we detect flaky tests, it is essential that we do our best to remove them. Otherwise, we start to lose faith in a test suite that “always fails like that.” A test suite with flaky tests can become a victim of what Diane Vaughan calls the normalization of deviance — the idea that over time we can become so accustomed to things being wrong that we start to accept them as being normal and not a problem.1
Brian
Yes.
49%
Flag icon
In “Eradicating Non-Determinism in Tests”, Martin Fowler advocates the approach that if you have flaky tests, you should track them down and if you can’t immediately fix them, remove them from the suite so you can treat them.
Brian
Agree with removing these from test suite. Fowler's post suggests keeping these tests (in "quarantine") but limiting size of quarantine to a small number and force yourself to fix one if you add above that limit. Eventually get to zero one hopes, but if one pops up, immediately fix or quarantine. Keep tests such that a failure means regression (not just bad luck)
49%
Flag icon
Sometimes organizations react by having a dedicated team write these tests. This can be disastrous. The team developing the software becomes increasingly distant from the tests for its code.
Brian
I haven't seen this pathology but my gut sides with the authors here
49%
Flag icon
The best balance I have found is to treat the end-to-end test suite as a shared codebase, but with joint ownership. Teams are free to check in to this suite, but the ownership of the health of the suite has to be shared between the teams developing the services themselves. If you want to make extensive use of end-to-end tests with multiple teams I think this approach is essential, and yet I have seen it done very rarely, and never without issue.
Brian
How to deal with end to end test ownership: it's hard but make a first class shared code base among rated services covered by tests.
50%
Flag icon
But what happens with 3, 4, 10, or 20 services? Very quickly these test suites become hugely bloated, and in the worst case can result in Cartesian-like explosion in the scenarios under test. This situation worsens if we fall into the trap of adding a new end-to-end test for every piece of functionality we add.
50%
Flag icon
The best way to counter this is to focus on a small number of core journeys to test for the whole system.
50%
Flag icon
We are trying to ensure that when we deploy a new service to production, our changes won’t break consumers. One way we can do this without requiring testing against the real consumer is by using a consumer-driven contract (CDC).
50%
Flag icon
With CDCs, we are defining the expectations of a consumer on a service (or producer). The expectations of the consumers are captured in code form as tests, which are then run against the producer. If done right, these CDCs should be run as part of the CI build of the producer, ensuring that it never gets deployed if it breaks one of these contracts.
Brian
These contract tests are maintained by clients of service? This might be intriguing
50%
Flag icon
A good practice here is to have someone from the producer and consumer teams collaborate on creating the tests,
Brian
Ok so they recommend it be collaboration between service and client. How about maintenance ?
50%
Flag icon
Pact is a consumer-driven testing tool that was originally developed in-house at RealEstate.com.au, but is now open source,
50%
Flag icon
Pact works in a very interesting way, as summarized in Figure 7-11. The consumer starts by defining the expectations of the producer using a Ruby DSL. Then, you launch a local mock server, and run this expectation against it to create the Pact specification file. The Pact file is just a formal JSON specification; you could obviously handcode these, but using the language API is much easier. This also gives you a running mock server that can be used for further isolated tests of the consumer.
Brian
Check it out; see if still actively maintained. Also: does this require fragile coupling of data in test envy with test expectations? Or maybe you define stateful scenario that starts with blank slate? Not quite clear on role of mock server when defining expectations. Is this the same mock that client dec team is using to isolate dec/testing from actual downstream service? That's a smart pattern really. It at least ensures that the mock and the real thing agree with one another (even if they both deviate from published apis).
50%
Flag icon
For this to work, the producer codebase needs access to the Pact file.
Brian
Why? Shouldn't these be http calls into a deployed service (even if downstream calls are mocked themselves). Who should specify what the downstream mock/stub behavior is? Not part of the test specification I hope.
50%
Flag icon
Pacto, which is also a Ruby tool used for consumer-driven testing. It has the ability to record interactions between client and server to generate the expectations. This makes writing consumer-driven contracts for existing services fairly easy. With Pacto, once generated these expectations are more or less static, whereas with Pact you regenerate the expectations in the consumer with every build.
50%
Flag icon
The fact that you can define expectations for capabilities the producer may not even have yet also better fits into a workflow where the producing service is still being (or has yet to be) developed.
Brian
They mean with pact you can but not with pacto? Can't you do it with either? Or they mean if you have to rebuild spec every time, it's not going to work but with pacto you build only once? I like that pact is also testing the mocks that consumer is using.
51%
Flag icon
A common example of this is the smoke test suite, a collection of tests designed to be run against newly deployed software to confirm that the deployment worked. These tests help you pick up any local environmental issues. If you’re using a single command-line command to deploy any given microservice (and you should), this command should run the smoke tests automatically.
Brian
Smoke tests: deploy but test in situ before directing prod traffic at new deploymwnt
51%
Flag icon
Another example of this is what is called blue/green deployment. With blue/green, we have two copies of our software deployed at a time, but only one version of it is receiving real requests.
51%
Flag icon
It is common to keep the old version around for a short period of time, allowing for a fast fallback if you detect any errors.
Brian
Great idea (if you can afford the hosts): keep old deployment around till new one proves healthy in prod. Rollback /might/ be as simple as reverting back to old fleet (no redeploy required). A rolling release is similar but takes longer to deploy (and to revert).
51%
Flag icon
With canary releasing, we are verifying our newly deployed software by directing amounts of production traffic against the system to see if it performs as expected.
Brian
Canary testing: release and verify results of prod traffic against prior release (that's actually handling requests)
52%
Flag icon
When considering canary releasing, you need to decide if you are going to divert a portion of production requests to the canary or just copy production load.
52%
Flag icon
Sometimes expending the same effort into getting better at remediation of a release can be significantly more beneficial than adding more automated functional tests. In the web operations world, this is often referred to as the trade-off between optimizing for mean time between failures (MTBF) and mean time to repair (MTTR).
Brian
The metrics you may want to ultimately optimize and importantly, find the right tradeoff/balance. Automated testing will help up to a point, but then you might reduce impact of a failure by investing in monitoring and faster rollback.
53%
Flag icon
Due to the time it takes to run performance tests, it isn’t always feasible to run them on every check-in. It is a common practice to run a subset every day, and a larger set every week. Whatever approach you pick, make sure you run them as regularly as you can. The longer you go without running performance tests, the harder it can be to track down the culprit.
53%
Flag icon
Chapter 8. Monitoring
Brian
This is a good chapter, turns into a good checklist.
56%
Flag icon
One approach that can be useful here is to use correlation IDs. When the first call is made, you generate a GUID for the call. This is then passed along to all subsequent calls, as seen in Figure 8-5, and can be put into your logs in a structured way, much as you’ll already do with components like the log level or date. With the right log aggregation tooling, you’ll then be able to trace that event all the way through your system:
56%
Flag icon
Given that you’ll already want log aggregation for other purposes, it feels much simpler to instead make use of data you’re already collecting than have to plumb in additional sources of data.
56%
Flag icon
This is especially problematic, as retrofitting correlation IDs in is very difficult; you need to handle them in a standardized way to be able to easily reconsititute call chains. Although it might seem like additional work up front, I would strongly suggest you consider putting them in as soon as you can, especially if your system will make use of event-driven architecture patterns, which can lead to some odd emergent behavior.
Brian
Add correlation I'd tracking as early as possible. Now.
56%
Flag icon
For example, if you are using HTTP as the underlying protocol for communication, just wrap a standard HTTP client library, adding in code to make sure you propogate the correlation IDs in the headers.
Brian
Can consider really thin shared library for correlation I'd tracking. Should definitely have tests for it; maybe can used standardized lib for some tests across the board instead? Just make sure they propagate.
56%
Flag icon
Therefore, monitoring the integration points between systems is key. Each service instance should track and expose the health of its downstream dependencies, from the database to other collaborating services. You should also allow this information to be aggregated to give you a rolled-up picture. You’ll want to see the response time of the downstream calls, and also detect if it is erroring.
Brian
Monitor your downstream dependencies as well.
57%
Flag icon
Track inbound response time at a bare minimum. Once you’ve done that, follow with error rates and then start working on application-level metrics.
57%
Flag icon
Track the health of all downstream responses, at a bare minimum including the response time of downstream calls, and at best tracking error rates. Libraries like Hystrix can help here.
57%
Flag icon
Log into a standard location, in a standard format if possible. Aggregation is a pain if every ser...
This highlight has been truncated due to consecutive passage length restrictions.
57%
Flag icon
Ensure your metric storage tool allows for aggregation at a system or service level, and drill down to individual hosts.
57%
Flag icon
Have a single, queryable tool for aggregating and storing logs.
57%
Flag icon
Strongly consider standardizing on the use of correlation IDs.
57%
Flag icon
Investigate the possibility of unifying how you aggregate all of your various metrics by seeing if a tool like Suro or Riemann makes sense for you.
58%
Flag icon
If you go the gateway route, make sure your developers can launch their services behind one without too much work.
Brian
If you use gateway service for authentication in front of a number of services, make it easy to insert one for dev environment to test for problems that might only come up in prod otherwise. Also: be sure to follow security in depth, at each service, don't do mullet-style security (business up front, party in the back) by putting all 'security' into one place.
58%
Flag icon
Do be careful, though. Gateway layers tend to take on more and more functionality, which itself can end up being a giant coupling point. And the more functionality something has, the greater the attack surface.
Brian
Keep authentication/routing gateway layers as thin as you can
59%
Flag icon
These decisions need to be local to the microservice in question. I have seen people use the various attributes supplied by identity providers in horrible ways, using really fine-grained roles like CALL_CENTER_50_DOLLAR_REFUND, where they end up putting information specific to one part of one of our system’s behavior into their directory services. This is a nightmare to maintain and gives very little scope for our services to have their own independent lifecycle, as suddenly a chunk of information about how a service behaves lives elsewhere, perhaps in a system managed by a different part of ...more
59%
Flag icon
Self-signed certificates are not easily revokable, and thus require a lot more thought around disaster scenarios. See if you can dodge all this work by avoiding self-signing altogether.
61%
Flag icon
There is a type of vulnerability called the confused deputy problem, which in the context of service-to-service communication refers to a situation where a malicious party can trick a deputy service into making calls to a downstream service on his behalf that he shouldn’t be able to. For example, as a customer, when I log in to the online shopping system, I can see my account details. What if I could trick the online shopping UI into making a request for someone else’s details, maybe by making a call with my logged-in credentials?
61%
Flag icon
Depending on the sensitivity of the operation in question, you might have to choose between implicit trust, verifying the identity of the caller, or asking the caller to provide the credentials of the original principal.
Brian
Can choose implicit trust for authentication but verify authorization? No credential passing needed, just identity of original caller. Some risk of breaching perimeter and then calling Willy-Nilly to downstream services. Could be devastating based on the identity that gets compromised. if customers only have access to their own data, should be fine, but if there are admin accts, those will be the targets.
62%
Flag icon
we need to make sure that our backups are also encrypted. This also means that we need to know which keys are needed to handle which version of data, especially if the keys change. Having clear key management becomes fairly important.
Brian
Good point re: key management so you know which key to use to decrypt backups. Probably don't want to decrypt/reencrypt all past backups when you rotate keys (though I guess technically that might be more secure assuming the reencrypt step is very safe?)
62%
Flag icon
Sensitive information needs to be culled to ensure we aren’t leaking important data into our logs, which could end up being a great target for attackers.
Brian
Whitelisting what gets logged seems safest approach but that may not be easy with free form logging. Wonder what best practices are here. Might be some opportunity for shared library, especially for request/response loggging, but perhaps also for converting objects to strings purposes...force each class to specify a safe-to-string method or whitelist attributes to include in generic to-string?
63%
Flag icon
When logging a request from a user, do we need to store the entire IP address forever, or could we replace the last few digits with x? Do we need to store someone’s name, age, gender, and date of birth in order to provide her with product offers, or is her age range and postcode enough information?
Brian
Makes sense