More on this book
Community
Kindle Notes & Highlights
Read between
July 23, 2017 - June 5, 2022
Many services are now expected to be highly available; extended downtime due to outages or maintenance is becoming increasingly unacceptable.
To be fair, though, not all services need to be that reliable. No one's health, life or livelihood is going to be adversely impacted by a sporadic timeout.
Corey and 1 other person liked this

· Flag
Brian
The buzzwords that fill this space are a sign of enthusiasm for the new possibilities, which is a great thing.
I remember when XML was the new hotness and everyone wanted to know if their software interacted with XML.
People seemed crestfallen when I would explain that it was just a file format.
Nathan Leech and 4 other people liked this
Fortunately, behind the rapid changes in technology, there are enduring principles that remain true, no matter which version of a particular tool you are using.
I learned how to write native code for Windows and/or the Macintosh at least a half dozen times, since things kept changing but staying the same. Win32 API, MFC, ATL, Macintosh Toolbox, PowerPlant, MacApp... ugh.
And as you learn a half dozen basically identical technologies, you will never, ever be able to keep all the details straight. It will just be one big blur of technology and configuration.
Even worse, some details will remain forever embedded in your skull, no matter how increasingly irrelevant they are. Did you know that when you draw a line between two points, on Windows it will not include the last pixel but on the old Macintosh toolbox it will?
“You’re not Google or Amazon. Stop worrying about scale and just use a relational database.” There is truth in that statement: building for scale that you don’t need is wasted effort and may lock you into an inflexible design.
Most of the stuff Google and Amazon do could be done using a relational database too...
But you will usually get someone in a position of authority who makes a broad, general statement, which then gets taken as gospel, and you can either fight against it, or just accept that your tiny low-traffic application has to use a NoSQL data store.
It's often easier to just give in, or somehow misinterpret the decree so you can use whatever you wanted to use in the first place.
Peter Christensen liked this
However, the term “Big Data” is so overused and underdefined that it is not useful in a serious engineering discussion.
One of the nice things about having worked for the Goo Factory is that when someone mentions "big data", I can ask how big and then scoff.
That shit never gets old.
(I did not actually work with big data, I'm just an asshole)
Rachel and 2 other people liked this
Store data so that they, or another application, can find it again later (databases) Remember the result of an expensive operation, to speed up reads (caches)
The boundary between a database and a cache is very, very fuzzy, and it is common that a database will be used as a cache.
Using a caching technology as your database is less common, but with appropriate backups, it can be a very good idea.
I would refer to master and secondary stores.
When building an application, we still need to figure out which tools and which approaches are the most appropriate for the task at hand.
No, no, just which tools are adequate.
And favor tools that you are already using unless they are inadequate. No one wsants to maintain AwesomeCache AND BoringCache, even if AwesomeCache is awesome.
If you are using BoringCache for other applications you maintain, just use BoringCache for the new application unless you have a good reason.
Also, if you are looking for a midsize sedan, you should always test drive a Honda Accord.
Peter Christensen liked this
increasingly many applications now have such demanding or wide-ranging requirements that a single tool can no longer meet all of its data processing and storage needs. Instead, the work is broken down into tasks that can be performed efficiently on a single tool, and those different tools are stitched together using application code.
That just seems like an opportunity to build a new tool. Don't give in to the temptation until the third time at least.
You need to learn your use cases.
Working with "generic" code that is really just specific is a nuisance.
For example, if you have an application-managed caching layer (using Memcached or similar), or a full-text search server (such as Elasticsearch or Solr) separate from your main database, it is normally the application code’s responsibility to keep those caches and indexes in sync with the main database. Figure 1-1 gives a glimpse of what this may look like (we will go into detail in later chapters).
This is a great example. Youre thinking "everyone needs to do this! Let's solve it once and for all!"
If everyone needs to do this, and it was easy, it would exist. You don't understand the problem yet...
I remember when I was young and full of hope. I miss hope. Hope was fun.
The system should continue to work correctly (performing the correct function at the desired level of performance) even in the face of adversity (hardware or software faults, and even human error). See “Reliability”
Well, that's pretty much impossible. If the service you depend on to tell you the length of a book has failed, you can't calculate whether the user is displaying more than N% of the book.
You can, however, fail gracefully and deliberately. Maybe the book length hasn't changed and you can get it from the cache even if it has expired. Maybe you shrug and say that the user can share that additional bit. Maybe you fail.
Maybe you shrug three times and then start failing.
But, you cannot do all your work without all the services you depend upon, and you have to assume that sometimes they will be down.
Most important, however, is that when everything is running again, your service self-repair -- if you are returning partial information, don't cache it for a long time, etc.
Over time, many different people will work on the system (engineering and operations, both maintaining current behavior and adapting the system to new use cases), and they should all be able to work on it productively.
Imagine working for a company where the median engineer leaves within two years. Some people don't have to imagine that.
Maybe the right approach is to fix the company so its not a problem -- give people opportunities for advancement, and make it suck less on a day to day basis. But you don't have that option.
And maybe the next best option is to just leave within two years. But, your life is complicated and that isn't an option either...
So, you have to ensure thee new batch of young'uns can handle oncall without dragging you in. If you fail, your complicated life will be interrupted and you will need to find another job, and that time between when you realize you failed and when you find that new job is going to suck.
That's why maintainability is important.
Good fucking lord do I miss hope.
Corey and 1 other person liked this
It can tolerate the user making mistakes or using the software in unexpected ways.
One of the most fun parts of working on community software is that people will use it in ways you don't expect.
Moving a bunch of big data around? Computers do that shit standing on their heads. That's no fun.
But what if your apparently useless feature, designed to pull bad content out of better content, allows people to play word games for a decade, build friendships, and create a community that you will never understand the importance of? That shit is awesome.
Brian and 1 other person liked this
If the entire planet Earth (and all servers on it) were swallowed by a black hole, tolerance of that fault would require web hosting in space — good luck getting that budget item approved. So it only makes sense to talk about tolerating certain types of faults.
The servers dont care. The user requests will plummet to zero.
In fact, aside from the minor issue of the servers being compacted into a singularity, everything would be great -- no failed requests!
Good luck getting the performance metrics outside of the singularity though.
Many critical bugs are actually due to poor error handling [3]; by deliberately inducing faults, you ensure that the fault-tolerance machinery is continually exercised and tested, which can increase your confidence that faults will be handled correctly when they occur naturally.
A good approach can be to update your data in the master, and then just trigger your "caches are inconsistent" code.
It regularly runs and verifies your ability to recover from errors. Detecting errors in the general case can still be a problem though.
Set up detailed and clear monitoring, such as performance metrics and error rates.
On the other hand, if you don't have the metrics and alarms, you will sleep through the night no matter what happens.
Do I need to be woken because someone cannot add a note to a highlight? It depends.
Think twice, but be honest with yourself. An additional 200ms to your tp99 can probably wait until morning.
outages of ecommerce sites can have huge costs in terms of lost revenue
Most people come back. Really.
Once you hit a certain level of reliability, it isnt really worth it to mindlessly keep pursuing more reliability. That level will differ for different applications.
If you are an ad based business serving up niche pornography, people will wait for you to come back online, or put up with a level of flakiness to get their My Little Pony Porn. If your service regulates iron lungs, however, your users may be less forgiving.
Most applications are somewhere between the two.
Latency and response time are often used synonymously, but they are not the same. The response time is what the client sees:
The real thing you care about is perceived response time. Dazzle your user with the quick stuff while the slow stuff slowly fills in.
Speeding up the entire page might not be worth it if it slows down the fastest bits.
Note that the median refers to a single request; if the user makes several requests (over the course of a session, or because several resources are included in a single page), the probability that at least one of them is slower than the median is much greater than 50%.
And if you are making multiple requests to the same service, there is a good chance that your monitoring system will do stupid things that makes that hard to measure.
For example, Amazon describes response time requirements for internal services in terms of the 99.9th percentile, even though it only affects 1 in 1,000 requests.
Something to be wary of is tp99s on the general population when you have sparse data.
Its easy to return nothing, and if you are starting a service for a feature, the data may start as sparse. Create another metric for when you are returning data and monitor that so you arent surprised as the data fills in.
When generating load artificially in order to test the scalability of a system, the load-generating client needs to keep sending requests independently of the response time.
Generating load is damn hard. It's hard to get a load that matches normal user patterns. Way harder than you expect.
Consider using actual user load, or recording user load and playing it back at N times the rate.
Once your service is running, if you understand the performance characteristics, it can be a good idea to periodically remove the servers until your service dies an ignoble death -- it lets you know what is going to happen first with a higher load without a lot of work.
This is only useful if your service can trust its dependencies to scale better than it. If you have a SQL server, then you have to do the hard work of generating load.
Brian liked this
If the client waits for the previous request to complete before sending the next one, that behavior has the effect of artificially keeping the queues shorter in the test than they would be in reality, which skews the measurements
At a certain point, clients will hit the refresh button, and actually increase the load.
Sometimes I wonder if Goodreads has reached that point.
Even if only a small percentage of backend calls are slow, the chance of getting a slow call increases if an end-user request requires multiple backend calls, and so a higher proportion of end-user requests end up being slow
If your service mkes 10 calls to a backend, your tp99 is their tp99.9.
Wait times in your thread pools can also be a big problem, and it is worth measuring.
Beware that averaging percentiles, e.g., to reduce the time resolution or to combine data from several machines, is mathematically meaningless — the right way of aggregating response time data is to add the histograms
Good luck with that.
Grass spews out a metric shitload of requests, with the number based on the user's social graph and the number of books in a work. We measure the maximum wait time for our threads per request, and take the tp99 of that across requests across many servers.
It's mathematically meaningless as a measurement, but when it goes up we know we need to adjust our thread pools or add servers.
If you are working on a fast-growing service, it is therefore likely that you will need to rethink your architecture on every order of magnitude load increase — or perhaps even more often than that.
That is a mistake for all but the largest services.
Choose tools that scale better than your application will have to -- dynamo, etc. -- and you shouldn't have to rearchitect that often.
Peter Christensen liked this
It is well known that the majority of the cost of software is not in its initial development, but in its ongoing maintenance — fixing bugs, keeping its systems operational, investigating failures, adapting it to new platforms, modifying it for new use cases, repaying technical debt, and adding new features.
That is only the case if the software is successful.
Consider the likelihod of success as well as whose money you are spending when deciding how scalable to go.
Brian and 1 other person liked this
Ease of updating — the name is stored in only one place, so it is easy to update across the board if it ever needs to be changed (e.g., change of a city name due to political events)
In a relational database, the query optimizer automatically decides which parts of the query to execute in which order, and which indexes to use. Those choices are effectively the “access path,” but the big difference is that they are made automatically by the query optimizer, not by the application developer, so we rarely need to think about them.
A document is usually stored as a single continuous string, encoded as JSON, XML, or a binary variant thereof (such as MongoDB’s BSON).
Total document size can become an issue pretty quickly. The datastore may not handle large documents, and you may not enjoy parsing that much JSON to get the patient's middle initial.
MapReduce doesn’t have a monopoly on distributed query execution.
You also don't need to distribute MapReduce. A friend of mine worked at a company that wanted to use MapReduce, but didn't want to maintain a large datacenter, so they ran a bunch of VMs on a single machine, and ran MapReduce on those VMs.
This was one of the stupidest solutions I have ever heard of, and when I told this story at the Goo Factory, no one believed me.
Cypher is a declarative query language for property graphs, created for the Neo4j graph database [37]. (It is named after a character in the movie The Matrix and is not related to ciphers in cryptography [38].)
The person who named it should be beaten to death.
As bad as my names are, they are seldom misleading, unless there is an awesome pun.
If you read more about triple-stores, you may get sucked into a maelstrom of articles written about the semantic web.
Peter Christensen liked this