More on this book
Community
Kindle Notes & Highlights
by
Sam Newman
Read between
April 25 - November 21, 2021
In the world of the monolithic application, we at least have a very obvious place to start our investigations. Website slow? It’s the monolith. Website giving odd errors? It’s the monolith. CPU at 100%? Monolith. Smell of burning? Well, you get the idea. Having a single point of failure also makes failure investigation somewhat simpler!
monitor the small things, and use aggregation to see the bigger picture.
We may even get advanced and use logrotate to move old logs out of the way and avoid them taking up all our disk space.
The answer is collection and central aggregation of as much as we can get our hands on, from logs to application metrics.
The smarter we are in tracking our trends and knowing what to do with them, the more cost effective and responsive our systems can be.
This fake event we created is an example of synthetic transaction. We used this synthetic transaction to ensure the system was behaving semantically, which is why this technique is often called semantic monitoring.
What our people want to see and react to right now is different than what they need when drilling down. So, for the type of person who will be looking at this data, consider the following: What they need to know right now What they might want later How they like to consume data
For each service: Track inbound response time at a bare minimum. Once you’ve done that, follow with error rates and then start working on application-level metrics.
Track the health of all downstream responses, at a bare minimum including the response time of downstream calls, and at best tracking error rates. Libraries like Hystrix can help here.
Standardize on how and where metrics ...
This highlight has been truncated due to consecutive passage length restrictions.
Log into a standard location, in a standard format if possible. Aggregation is a pain if every ser...
This highlight has been truncated due to consecutive passage length restrictions.
Monitor the underlying operating system so you can track down rogue processes a...
This highlight has been truncated due to consecutive passage length restrictions.
For the system: Aggregate host-level metrics like CPU together with appl...
This highlight has been truncated due to consecutive passage length restrictions.
Ensure your metric storage tool allows for aggregation at a system or service level, and dril...
This highlight has been truncated due to consecutive passage length restrictions.
Ensure your metric storage tool allows you to maintain data long enough to understa...
This highlight has been truncated due to consecutive passage length restrictions.
Have a single, queryable tool for aggregating a...
This highlight has been truncated due to consecutive passage length restrictions.
Strongly consider standardizing on the use of ...
This highlight has been truncated due to consecutive passage length restrictions.
Understand what requires a call to action, and structure alerting and d...
This highlight has been truncated due to consecutive passage length restrictions.
Investigate the possibility of unifying how you aggregate all of your various metrics by seeing if a tool like Sur...
This highlight has been truncated due to consecutive passage length restrictions.
authentication is the process by which we confirm that a party is who she says she is.
Generally, when we’re talking abstractly about who or what is being authenticated, we refer to that party as the principal.
Authorization is the mechanism by which we map from a principal to the action we are allowing her to do.
SAML is a SOAP-based standard, and is known for being fairly complex to work with despite the libraries and tooling available to support it. OpenID Connect is a standard that has emerged as a specific implementation of OAuth 2.0, based on the way Google and others handle SSO. It uses simpler REST calls, and in my opinion is likely to make inroads into enterprises due to its improved ease of use.
I have seen some people put all their eggs in one basket, relying on the gateway to handle every step for them. And we all know what happens when we have a single point of failure…
the more functionality something has, the greater the attack surface.
favor coarse-grained roles, modeled around how your organization works.
we are building software to match how our organization works. So use your roles in this way too.
should an attacker penetrate your network, you will have little protection against a typical man-in-the-middle attack.
This is by far the most common form of inside-perimeter trust I see in organizations. They may decide to run this traffic over HTTPS, but they don’t do much else. I’m not saying that is a good thing! For most of the organizations I see using this model, I worry that the implicit trust model is not a conscious decision, but more that people are unaware of the risks in the first place.
Self-signed certificates are not easily revokable, and thus require a lot more thought around disaster scenarios. See if you can dodge all this work by avoiding self-signing altogether.
A word of warning, though: if you are going to create service accounts, try to keep their use narrow. So consider each microservice having its own set of credentials. This makes revoking/changing access easier if the credentials become compromised, as you only need to revoke the set of credentials that have been affected.
An alternative approach, as used extensively by Amazon’s S3 APIs for AWS and in parts of the OAuth specification, is to use a hash-based messaging code (HMAC) to sign the request. With HMAC the body request along with a private key is hashed, and the resulting hash is sent along with the request. The server then uses its own copy of the private key and the request body to re-create the hash. If it matches, it allows the request. The nice thing here is that if a man in the middle messes with the request, then the hash won’t match and the server knows the request has been tampered with.
This problem, unfortunately, has no simple answer, because it isn’t a simple problem. Be aware that it exists, though. Depending on the sensitivity of the operation in question, you might have to choose between implicit trust, verifying the identity of the caller, or asking the caller to provide the credentials of the original principal.
The easiest way you can mess up data encryption is to try to implement your own encryption algorithms, or even try to implement someone else’s. Whatever programming language you use, you’ll have access to reviewed, regularly patched implementations of well-regarded encryption algorithms. Use those! And subscribe to the mailing lists/advisory lists for the technology you choose to make sure you are aware of vulnerabilities as they are found so you can keep them patched and up to date.
For passwords, you should consider using a technique called salted password hashing
we need to be careful about what information we store in our logs! Sensitive information needs to be culled to ensure we aren’t leaking important data into our logs, which could end up being a great target for attackers.
Intrusion detection systems (IDS) can monitor networks or hosts for suspicious behavior, reporting problems when it sees them. Intrusion prevention systems (IPS), as well as monitoring for suspicious activity, can step in to stop it from happening.
For the browser, we’ll use a mix of standard HTTP traffic for nonsecure content, to allow for it to be cached. For secure, logged-in pages, all secure content will be sent over HTTPS, giving our customers extra protection if they are doing things like running on public WiFi networks.
The data that pertains to an individual, or could be used to derive information about an individual, must be the data we are most careful about.
The German phrase Datensparsamkeit represents this concept. Originating from German privacy legislation, it encapsulates the concept of only storing as much information as is absolutely required to fulfill business operations or satisfy local laws.
don’t write your own crypto.
Reinventing the wheel in many cases is often just a waste of time, but when it comes to security it can be outright dangerous.
Getting people familar with the OWASP Top Ten list and OWASP’s Security Testing Framework can be a great place to start.
Moore’s law, for example, which states that the density of transistors on integrated circuits doubles every two years, has proved to be uncannily accurate (although some people predict that this trend is already slowing).
Any organization that designs a system (defined more broadly here than just information systems) will inevitably produce a design whose structure is a copy of the organization’s communication structure.
“If you have four groups working on a compiler, you’ll get a 4-pass compiler.”
Netflix designed the organizational structure for the system architecture it wanted.
This single team finds it easy to communicate about proposed changes and refactorings, and typically has a good sense of ownership.
When the cost of coordinating change increases, one of two things happen. Either people find ways to reduce the coordination/communication costs, or they stop making changes. The latter is exactly how we end up with large, hard-to-maintain codebases.
It is also worth noting at this point that, at least based on the observations of the authors of the Exploring the Duality Between Product and Organizational Architectures report previously referenced, if the organization building the system is more loosely coupled (e.g., consisting of geographically distributed teams), the systems being built tend toward the more modular, and therefore hopefully less coupled. The tendency of a single team that owns many services to lean toward tighter integration is very hard to maintain in a more distributed organization.

