Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
Rate it:
Open Preview
0%
Flag icon
Many services are now expected to be highly available; extended downtime due to outages or maintenance is becoming increasingly unacceptable.
Robert Gustavo
To be fair, though, not all services need to be that reliable. No one's health, life or livelihood is going to be adversely impacted by a sporadic timeout.
Corey and 1 other person liked this
Brian
· Flag
Brian
Very true. Make sure your investments in availability are well placed.
0%
Flag icon
The buzzwords that fill this space are a sign of enthusiasm for the new possibilities, which is a great thing.
Robert Gustavo
I remember when XML was the new hotness and everyone wanted to know if their software interacted with XML. People seemed crestfallen when I would explain that it was just a file format.
Jeff
· Flag
Jeff
X is for eXtreme!
Cbkb
· Flag
Cbkb
I'm glad you liked this update. This is a comment on that liking.
0%
Flag icon
Fortunately, behind the rapid changes in technology, there are enduring principles that remain true, no matter which version of a particular tool you are using.
Robert Gustavo
I learned how to write native code for Windows and/or the Macintosh at least a half dozen times, since things kept changing but staying the same. Win32 API, MFC, ATL, Macintosh Toolbox, PowerPlant, MacApp... ugh. And as you learn a half dozen basically identical technologies, you will never, ever be able to keep all the details straight. It will just be one big blur of technology and configuration. Even worse, some details will remain forever embedded in your skull, no matter how increasingly irrelevant they are. Did you know that when you draw a line between two points, on Windows it will not include the last pixel but on the old Macintosh toolbox it will?
0%
Flag icon
“You’re not Google or Amazon. Stop worrying about scale and just use a relational database.” There is truth in that statement: building for scale that you don’t need is wasted effort and may lock you into an inflexible design.
Robert Gustavo
Most of the stuff Google and Amazon do could be done using a relational database too... But you will usually get someone in a position of authority who makes a broad, general statement, which then gets taken as gospel, and you can either fight against it, or just accept that your tiny low-traffic application has to use a NoSQL data store. It's often easier to just give in, or somehow misinterpret the decree so you can use whatever you wanted to use in the first place.
0%
Flag icon
However, it’s also important to choose the right tool for the job,
Robert Gustavo
No, it's important to choose an adequate tool for the job, not the right one. You will never know what pitfalls doing everything in go or Haskell would cause, you will just lament the lack of coprocesses or monads.
0%
Flag icon
However, the term “Big Data” is so overused and underdefined that it is not useful in a serious engineering discussion.
Robert Gustavo
One of the nice things about having worked for the Goo Factory is that when someone mentions "big data", I can ask how big and then scoff. That shit never gets old. (I did not actually work with big data, I'm just an asshole)
Rachel and 2 other people liked this
1%
Flag icon
Store data so that they, or another application, can find it again later (databases) Remember the result of an expensive operation, to speed up reads (caches)
Robert Gustavo
The boundary between a database and a cache is very, very fuzzy, and it is common that a database will be used as a cache. Using a caching technology as your database is less common, but with appropriate backups, it can be a very good idea. I would refer to master and secondary stores.
1%
Flag icon
When building an application, we still need to figure out which tools and which approaches are the most appropriate for the task at hand.
Robert Gustavo
No, no, just which tools are adequate. And favor tools that you are already using unless they are inadequate. No one wsants to maintain AwesomeCache AND BoringCache, even if AwesomeCache is awesome. If you are using BoringCache for other applications you maintain, just use BoringCache for the new application unless you have a good reason. Also, if you are looking for a midsize sedan, you should always test drive a Honda Accord.
Brian
· Flag
Brian
More interesting is when Use Case R comes up and for whatever reason BoringCache is insufficient. Use Cases A-Q remain on BC and R uses AwesomeCache? You still get proliferation of technologies with t…
1%
Flag icon
increasingly many applications now have such demanding or wide-ranging requirements that a single tool can no longer meet all of its data processing and storage needs. Instead, the work is broken down into tasks that can be performed efficiently on a single tool, and those different tools are stitched together using application code.
Robert Gustavo
That just seems like an opportunity to build a new tool. Don't give in to the temptation until the third time at least. You need to learn your use cases. Working with "generic" code that is really just specific is a nuisance.
1%
Flag icon
For example, if you have an application-managed caching layer (using Memcached or similar), or a full-text search server (such as Elasticsearch or Solr) separate from your main database, it is normally the application code’s responsibility to keep those caches and indexes in sync with the main database. Figure 1-1 gives a glimpse of what this may look like (we will go into detail in later chapters).
Robert Gustavo
This is a great example. Youre thinking "everyone needs to do this! Let's solve it once and for all!" If everyone needs to do this, and it was easy, it would exist. You don't understand the problem yet... I remember when I was young and full of hope. I miss hope. Hope was fun.
Peter Christensen
· Flag
Peter Christensen
Shudder - keeping data in sync across multiple stores and caches.
1%
Flag icon
The system should continue to work correctly (performing the correct function at the desired level of performance) even in the face of adversity (hardware or software faults, and even human error). See “Reliability”
Robert Gustavo
Well, that's pretty much impossible. If the service you depend on to tell you the length of a book has failed, you can't calculate whether the user is displaying more than N% of the book. You can, however, fail gracefully and deliberately. Maybe the book length hasn't changed and you can get it from the cache even if it has expired. Maybe you shrug and say that the user can share that additional bit. Maybe you fail. Maybe you shrug three times and then start failing. But, you cannot do all your work without all the services you depend upon, and you have to assume that sometimes they will be down. Most important, however, is that when everything is running again, your service self-repair -- if you are returning partial information, don't cache it for a long time, etc.
Brian
· Flag
Brian
Yes. Desired level of performance can likely include a degraded experience that still offers the core behaviors. The % of book is an interesting case study though--what is a MUST here?
1%
Flag icon
Over time, many different people will work on the system (engineering and operations, both maintaining current behavior and adapting the system to new use cases), and they should all be able to work on it productively.
Robert Gustavo
Imagine working for a company where the median engineer leaves within two years. Some people don't have to imagine that. Maybe the right approach is to fix the company so its not a problem -- give people opportunities for advancement, and make it suck less on a day to day basis. But you don't have that option. And maybe the next best option is to just leave within two years. But, your life is complicated and that isn't an option either... So, you have to ensure thee new batch of young'uns can handle oncall without dragging you in. If you fail, your complicated life will be interrupted and you will need to find another job, and that time between when you realize you failed and when you find that new job is going to suck. That's why maintainability is important. Good fucking lord do I miss hope.
Corey and 1 other person liked this
1%
Flag icon
It can tolerate the user making mistakes or using the software in unexpected ways.
Robert Gustavo
One of the most fun parts of working on community software is that people will use it in ways you don't expect. Moving a bunch of big data around? Computers do that shit standing on their heads. That's no fun. But what if your apparently useless feature, designed to pull bad content out of better content, allows people to play word games for a decade, build friendships, and create a community that you will never understand the importance of? That shit is awesome.
Brian and 1 other person liked this
1%
Flag icon
If the entire planet Earth (and all servers on it) were swallowed by a black hole, tolerance of that fault would require web hosting in space — good luck getting that budget item approved. So it only makes sense to talk about tolerating certain types of faults.
Robert Gustavo
The servers dont care. The user requests will plummet to zero. In fact, aside from the minor issue of the servers being compacted into a singularity, everything would be great -- no failed requests! Good luck getting the performance metrics outside of the singularity though.
Brian
· Flag
Brian
Actually, interesting thought experiment. How much residual traffic would there be without humans? How long would it take to go to zero? Network might not even notice.
1%
Flag icon
Many critical bugs are actually due to poor error handling [3]; by deliberately inducing faults, you ensure that the fault-tolerance machinery is continually exercised and tested, which can increase your confidence that faults will be handled correctly when they occur naturally.
Robert Gustavo
A good approach can be to update your data in the master, and then just trigger your "caches are inconsistent" code. It regularly runs and verifies your ability to recover from errors. Detecting errors in the general case can still be a problem though.
2%
Flag icon
Set up detailed and clear monitoring, such as performance metrics and error rates.
Robert Gustavo
On the other hand, if you don't have the metrics and alarms, you will sleep through the night no matter what happens. Do I need to be woken because someone cannot add a note to a highlight? It depends. Think twice, but be honest with yourself. An additional 200ms to your tp99 can probably wait until morning.
2%
Flag icon
outages of ecommerce sites can have huge costs in terms of lost revenue
Robert Gustavo
Most people come back. Really. Once you hit a certain level of reliability, it isnt really worth it to mindlessly keep pursuing more reliability. That level will differ for different applications. If you are an ad based business serving up niche pornography, people will wait for you to come back online, or put up with a level of flakiness to get their My Little Pony Porn. If your service regulates iron lungs, however, your users may be less forgiving. Most applications are somewhere between the two.
Brian
· Flag
Brian
Though for a company loses $10MM in revenue per hour, it's in its interest to invest a large % of that amount to further reduce an hour of outage per year (usually the cost is far less than that). Whe…
2%
Flag icon
Twitter is moving to a hybrid of both approaches.
Robert Gustavo
Pursue moderation in everything, including moderation.
2%
Flag icon
Latency and response time are often used synonymously, but they are not the same. The response time is what the client sees:
Robert Gustavo
The real thing you care about is perceived response time. Dazzle your user with the quick stuff while the slow stuff slowly fills in. Speeding up the entire page might not be worth it if it slows down the fastest bits.
Brian
· Flag
Brian
Agree with you. What's a good way to measure that?
2%
Flag icon
Note that the median refers to a single request; if the user makes several requests (over the course of a session, or because several resources are included in a single page), the probability that at least one of them is slower than the median is much greater than 50%.
Robert Gustavo
And if you are making multiple requests to the same service, there is a good chance that your monitoring system will do stupid things that makes that hard to measure.
2%
Flag icon
For example, Amazon describes response time requirements for internal services in terms of the 99.9th percentile, even though it only affects 1 in 1,000 requests.
Robert Gustavo
Something to be wary of is tp99s on the general population when you have sparse data. Its easy to return nothing, and if you are starting a service for a feature, the data may start as sparse. Create another metric for when you are returning data and monitor that so you arent surprised as the data fills in.
3%
Flag icon
Robert Gustavo
This is hard. Measuring the vip is often effective -- if you have a lot of ovderflow, add more service boxes.
Brian
· Flag
Brian
Doh, is this tied to an image?
3%
Flag icon
When generating load artificially in order to test the scalability of a system, the load-generating client needs to keep sending requests independently of the response time.
Robert Gustavo
Generating load is damn hard. It's hard to get a load that matches normal user patterns. Way harder than you expect. Consider using actual user load, or recording user load and playing it back at N times the rate. Once your service is running, if you understand the performance characteristics, it can be a good idea to periodically remove the servers until your service dies an ignoble death -- it lets you know what is going to happen first with a higher load without a lot of work. This is only useful if your service can trust its dependencies to scale better than it. If you have a SQL server, then you have to do the hard work of generating load.
Brian liked this
Brian
· Flag
Brian
Agree.
3%
Flag icon
If the client waits for the previous request to complete before sending the next one, that behavior has the effect of artificially keeping the queues shorter in the test than they would be in reality, which skews the measurements
Robert Gustavo
At a certain point, clients will hit the refresh button, and actually increase the load. Sometimes I wonder if Goodreads has reached that point.
3%
Flag icon
Even if only a small percentage of backend calls are slow, the chance of getting a slow call increases if an end-user request requires multiple backend calls, and so a higher proportion of end-user requests end up being slow
Robert Gustavo
If your service mkes 10 calls to a backend, your tp99 is their tp99.9. Wait times in your thread pools can also be a big problem, and it is worth measuring.
3%
Flag icon
Beware that averaging percentiles, e.g., to reduce the time resolution or to combine data from several machines, is mathematically meaningless — the right way of aggregating response time data is to add the histograms
Robert Gustavo
Good luck with that. Grass spews out a metric shitload of requests, with the number based on the user's social graph and the number of books in a work. We measure the maximum wait time for our threads per request, and take the tp99 of that across requests across many servers. It's mathematically meaningless as a measurement, but when it goes up we know we need to adjust our thread pools or add servers.
3%
Flag icon
If you are working on a fast-growing service, it is therefore likely that you will need to rethink your architecture on every order of magnitude load increase — or perhaps even more often than that.
Robert Gustavo
That is a mistake for all but the largest services. Choose tools that scale better than your application will have to -- dynamo, etc. -- and you shouldn't have to rearchitect that often.
3%
Flag icon
It is well known that the majority of the cost of software is not in its initial development, but in its ongoing maintenance — fixing bugs, keeping its systems operational, investigating failures, adapting it to new platforms, modifying it for new use cases, repaying technical debt, and adding new features.
Robert Gustavo
That is only the case if the software is successful. Consider the likelihod of success as well as whose money you are spending when deciding how scalable to go.
Brian and 1 other person liked this
Brian
· Flag
Brian
True. You probably don't care about ongoing maintenance when trying to find product fit. And you're right, when you know customers love it and probability of success is high, makes sense to realize th…
4%
Flag icon
A widespread preference for free and open source software over commercial database products
Robert Gustavo
As a software engineer, you should consider who you want to benefit from the products of your labor.
5%
Flag icon
Ease of updating — the name is stored in only one place, so it is easy to update across the board if it ever needs to be changed (e.g., change of a city name due to political events)
5%
Flag icon
In a relational database, the query optimizer automatically decides which parts of the query to execute in which order, and which indexes to use. Those choices are effectively the “access path,” but the big difference is that they are made automatically by the query optimizer, not by the application developer, so we rarely need to think about them.
Robert Gustavo
ha ha ha.
5%
Flag icon
If you want to query your data in new ways, you can just declare a new index, and queries will automatically use whichever indexes are most appropriate. You don’t need to change your queries to take advantage of a new index.
Robert Gustavo
Oh, that is rich...
6%
Flag icon
A document is usually stored as a single continuous string, encoded as JSON, XML, or a binary variant thereof (such as MongoDB’s BSON).
Robert Gustavo
Total document size can become an issue pretty quickly. The datastore may not handle large documents, and you may not enjoy parsing that much JSON to get the patient's middle initial.
7%
Flag icon
MapReduce doesn’t have a monopoly on distributed query execution.
Robert Gustavo
You also don't need to distribute MapReduce. A friend of mine worked at a company that wanted to use MapReduce, but didn't want to maintain a large datacenter, so they ran a bunch of VMs on a single machine, and ran MapReduce on those VMs. This was one of the stupidest solutions I have ever heard of, and when I told this story at the Goo Factory, no one believed me.
7%
Flag icon
PageRank can be used on the web graph to determine the popularity of a web page and thus its ranking in search results.
Robert Gustavo
Fun Fact: PigeonRank is almost as accurate at describing Google's search as PageRank.
8%
Flag icon
Graphs are good for evolvability: as you add features to your application, a graph can easily be extended to accommodate changes in your application’s data structures.
Robert Gustavo
Graphs are pretty awesome for everything but performance -- but you can add caching in front of it for your most pained use cases.
8%
Flag icon
Cypher is a declarative query language for property graphs, created for the Neo4j graph database [37]. (It is named after a character in the movie The Matrix and is not related to ciphers in cryptography [38].)
Robert Gustavo
The person who named it should be beaten to death. As bad as my names are, they are seldom misleading, unless there is an awesome pun.
8%
Flag icon
SQL:1999,
Robert Gustavo
SQL: 1999 was the unfortunate reimagining of Space: 1999, except with SQL instead of Space.
9%
Flag icon
If you read more about triple-stores, you may get sucked into a maelstrom of articles written about the semantic web.
Robert Gustavo
You have been warned.