Data is at the center of many challenges in system design today. Difficult issues need to be figured out, such as scalability, consistency, reliability, efficiency, and maintainability. In addition, we have an overwhelming variety of tools, including relational databases, NoSQL datastores, stream or batch processors, and message brokers. What are the right choices for your application? How do you make sense of all these buzzwords? In this practical and comprehensive guide, author Martin Kleppmann helps you navigate this diverse landscape by examining the pros and cons of various technologies for processing and storing data. Software keeps changing, but the fundamental principles remain the same. With this book, software engineers and architects will learn how to apply those ideas in practice, and how to make full use of data in modern applications.
I consider this book a mini-encyclopedia of modern data engineering. Like a specialized encyclopedia, it covers a broad field in considerable detail. But it is not a practice or a cookbook for a particular Big Data, NoSQL or newSQL product. What the author does is to lay down the principles of current distributed big data systems, and he does a very fine job of it.
If you are after the obscure details of a particular product, or some tutorials and "how-to"s, go elsewhere. But if you want to understand the main principles, issues, as well as the challenges of data intensive and distributed system, you've come to the right place.
Martin Kleppmann starts out by solidly giving the reader the conceptual framework in the first chapter: what does reliability mean? How is it defined? What is the difference between "fault" and "failure"? How do you describe load on a data intensive system? How do you talk about performance and scalability in a meaningful way? What does it mean to have a "maintainable" system?
Second chapter gives a brief overview of different data models and shows the suitability of them to different use cases, using modern challenges that companies such as Twitter faced. This chapter is a solid foundation for understanding the difference between the relational data model, document data model, graph data model, as well as the languages used for processing data stored using these models.
The third chapter goes into a lot of detail regarding the building blocks of different types of database systems: the data structures and algorithms used for the different systems shown in the previous chapter are described; you get to know hash indexes, SSTables (Sorted String Tables), Log-Structured Merge trees (LSM-trees), B-trees, and other data structures. Following this chapter, you are introduced to Column Databases, and the underlying principles and structures behind them.
Following these, the book describes the methods of data encoding, starting from the venerable XML & JSON, and going into the details of formats such as Avro, Thrift and Protocol Buffers, showing the trade-offs between these choices.
Following the building blocks and foundations comes "Part II", and this is where things start to get really interesting because now the reader starts to learn about challenging topic of distributed systems: how to use the basic building blocks in a setting where anything can go wrong in the most unexpected ways. Part II is the most complex of part the book: you learn about how to replicate your data, what happens when replication lags behind, how you provide a consistent picture to the end-user or the end-programmer, what algorithms are used for leader election in consensus systems, and how leaderless replication works.
One of the primary purpose of using a distributed system is to have an advantage over a single, central system, and that advantage is to provide better service, meaning a more resilient service with an acceptable level of responsiveness. This means you need to distribute the load and your data, and there a lot of schemes for partitioning your data. Chapter 6 of Part II provides a lot of details on partitioning, keys, indexes, secondary indexes and how to handle data queries when your data is partitioned using various methods.
No data systems book can be complete without touching the topic of transactions, and this book is not an exception to the rule. You learn about the fuzziness surrounding the definition of ACID, isolation levels, and serializability.
The remaining two chapters of Part II, Chapter 8 and 9 is probably the most interesting part of the book. You are now ready to learn the gory details of how to deal with all kinds of network and other types of faults to keep your data system in usable and consistent state, the problems with the CAP theorem, version vectors and that they are not vector clocks, Byzantine faults, how to have a sense of causality and ordering in a distributed system, why algorithms such as Paxos, Raft, and ZAB (used in ZooKeeper) exist, distributed transactions, and many more topics.
The rest of the book, that is Part III, is dedicated to batch and stream processing. The author describes the famous Map Reduce batch processing model in detail, and briefly touches upon the modern frameworks for processing distributed data processing such as Apache Spark. The final chapter discusses event streams and messaging systems and challenges that arise when trying to process this "data in motion". You might not be in the business of building the next generation streaming system, but you'll definitely need to have a handle on these topics because you'll encounter the described issues in the practical stream processing systems that you deal with daily as a data engineer.
As I said in the opening of this review, consider this a mini-encyclopedia for the modern data engineer, and also don't be surprised if you see more than 100 references at the end of some chapters; if the author tried to include most of them in the text itself, the book would well go beyond 2000 pages!
At the time of my writing, the book is 90% complete, according to its official site there's only 1 more chapter to be added (Chapter 12: Materialized Views and Caching), so it is safe to say that I recommend this book to anyone working with distributed big data systems, dealing with NoSQL and newSQL databases, document stores, column oriented data stores, streaming and messaging systems. As for me, it'll definitely be my go-to reference for the upcoming years for these topics.
Honestly, this one took me much more time than I've expected. Plus, it's definitely one of the best technical books I've read in years - but still, it doesn't mean you should run straight away to your bookshop - read up to the end of the review first.
I'll risk the statement that this book's content will not be 100% directly applicable to your work, BUT it will make you a better engineer in general. It's like with reading books about Haskell - most likely you'll never use this language for any practical project/product development, but understanding Haskell (& principles behind its design) will improve your functional-fu.
In this case, Martin (true expert, one of people who stood behind Kafka in LinkedIn - if I remember correctly), doesn't try to rediscover EAI patterns or feed you with CAP basics -> instead he dives deep into low level technical differences between practical implementations of message brokers, relational & non-relational databases. He discusses various aspects of distribution, but he doesn't do stop at theory. This book is all about practical differences, based on actual implementations in popular technologies.
No, 95% of us will not write stuff I tend to call "truly infrastructural". No, 95% of us will never get down to implementation of tombstones or dynamic shard re-balancing in Cassandra. But still, even reading about how those practical problems were solved will make us better engineers & will add more options (/ideas) to our palette. For some youngers - it will prove them that there's no mysterious magic behind technology they use - it's just good, solid, pragmatic engineering after all.
Great book. Truly recommended. Try it yourself & enjoy how it tastes :) I would give 6 freaking stars if I could.
A must-read for every programmer. This is the best overview of data storage and distributed systems—two key concepts for building almost any piece of software today—that I've seen anywhere. Martin does a wonderful job of taking a massive body of research and distilling complicated concepts and difficult trade-offs down to a level where anyone can understand it.
I learned a lot about replication, partitioning, linearizability, locking, write skew, phantoms, transactions, event logs, and more. I'm also a big fan of the final chapter, The Future of Data Systems, which covers ideas such as "unbundling the database" (i.e., using an event log as the primary data store, and handling all other aspects of the "database", such as secondary indexes, materialized views, and replication, in separate "derived" data systems), end-to-end event streams, and an important discussion on ethics in programming and data systems.
The only thing missing is a set of summary tables. I'd love to see a list of all common data systems and how they fair across many dimensions: e.g., support for locking, replication, transaction, consistency levels, and so on. This would be very handy for deciding what system to pick for my next project.
As always, I've saved a few of my favorite quotes from the book:
"Document databases are sometimes called schemaless, but that’s misleading, as the code that reads the data usually assumes some kind of structure—i.e., there is an implicit schema, but it is not enforced by the database. A more accurate term is schema-on-read (the structure of the data is implicit, and only interpreted when the data is read), in contrast with schema-on-write (the traditional approach of relational databases, where the schema is explicit and the database ensures all written data conforms to it). Schema-on-read is similar to dynamic (runtime) type checking in programming languages, whereas schema-on-write is similar to static (compile-time) type checking."
"For defining concurrency, exact time doesn’t matter: we simply call two operations concurrent if they are both unaware of each other, regardless of the physical time at which they occurred. People sometimes make a connection between this principle and the special theory of relativity in physics, which introduced the idea that information cannot travel faster than the speed of light. Consequently, two events that occur some distance apart cannot possibly affect each other if the time between the events is shorter than the time it takes light to travel the distance between them."
"A node in the network cannot know anything for sure—it can only make guesses based on the messages it receives (or doesn’t receive) via the network."
"The best way of building fault-tolerant systems is to find some general-purpose abstractions with useful guarantees, implement them once, and then let applications rely on those guarantees."
"CAP is sometimes presented as Consistency, Availability, Partition tolerance: pick 2 out of 3. Unfortunately, putting it this way is misleading because network partitions are a kind of fault, so they aren’t something about which you have a choice: they will happen whether you like it or not. At times when the network is working correctly, a system can provide both consistency (linearizability) and total availability. When a network fault occurs, you have to choose between either linearizability or total availability. Thus, a better way of phrasing CAP would be either Consistent or Available when Partitioned."
"The traditional approach to database and schema design is based on the fallacy that data must be written in the same form as it will be queried. Debates about normalization and denormalization (see“Many-to-One and Many-to-Many Relationships”) become largely irrelevant if you can translate data from a write-optimized event log to read-optimized application state: it is entirely reasonable to denormalize data in the read-optimized views, as the translation process gives you a mechanism for keeping it consistent with the event log."
"As algorithmic decision-making becomes more widespread, someone who has (accurately or falsely) been labeled as risky by some algorithm may suffer a large number of those “no” decisions. Systematically being excluded from jobs, air travel, insurance coverage, property rental, financial services, and other key aspects of society is such a large constraint of the individual’s freedom that it has been called “algorithmic prison”. In countries that respect human rights, the criminal justice system presumes innocence until proven guilty; on the other hand, automated systems can systematically and arbitrarily exclude a person from participating in society without any proof of guilt, and with little chance of appeal."
"Predictive analytics systems merely extrapolate from the past; if the past is discriminatory, they codify that discrimination. If we want the future to be better than the past, moral imagination is required, and that’s something only humans can provide."
Some quite valuable content diluted with less useful content. I think I’d much prefer to read this author’s focused articles or blogs than recommend that someone slog through this.
I’m still not quite sure who the intended audience of this book is, but it’s definitely not me. The intro chapter discusses the example of Twitter’s fan-out writes and how they balanced typical users with celebrities who have millions of followers. Because of that intro, I expected a series of architecture patterns and case studies from running systems at scale. What follows was nothing like that.
The book suffers greatly from being overly academic and abstract. It tries to achieve both breadth and depth. Much of the tone felt like an encyclopedia of data-related technologies. The author namedrops dozens of technologies, many of them outside mainstream use. The low-level database chapter was insufferably detailed. I would’ve stopped reading there but it was a tech book club pick at work and I felt compelled to finish it.
Does this author know a lot about data tech? Definitely. Did I learn things from this book? Sure. Do I have a stronger understanding of issues related to data? I think so. Will I approach data problems differently in the future? Possibly. Did I come away with strong endorsement of some application architecture(s) to consider when building my next system? Not really.
I feel like most of that must be available in better forms elsewhere. If not, someone please write some more books.
Like you'd expect of a technical book with such a broad scope, there are sections that most readers in the target audience will probably find either too foundational or too esoteric to justify writing about at this kind of length, but still - at its best, I shudder to think of the time wasted groping in the dark for an ad hoc understanding of concepts it explains holistically in just a few unfussy, lucid pages and a diagram or two.
Definitely a book I see myself reaching for as a reference or memory jogger for years to come.
(5.0) excellent summary/foundation/recommendations for distributed systems development, covers a lot of the use cases for data-intensive (vs compute-intensive) apps/services. I recommend to anyone doing service development.
Recommendations are well-reasoned, citations are helpful and are leading me to do a lot more reading.
Thank you for finding and sharing this one, @Chet. I think this will be a book we assign as a primer for working at Goodreads going forward. At least some of the (later) chapters.
The perception of this depends on how much do you know already.
If you know a lot about serialization: JSON, Avro, Google Protocol Buffers, MessagePack, you name it; db data structures: WAL, B+Tree, LSM, you name it; distributed systems: consensus (Paxos, Raft), messaging (at-least-once, at-most, idempotence), partitioning, you won't gain a lot.
If you read tens of whitepapers, read _internals_ books, you won't gain a lot.
If you run Jepsen tests on your own product, you won't gain a lot.
But if you didn't or you're just starting the journey into a modern world of applications, then this is a really important book to read.
At the beginning of reading this, I vacillated between a three-star and a four-star rating. The book is organised into three parts. The first part is about data storage on a single machine. Whenever it would cover material I already knew, I'd be mildly bored. Whenever it covered material that was unfamiliar to me, I found the explanations lucid and fascinating. I venture that I would have been as pleased with the topics I already knew about, had I not already known about them.
In part II, the book expands into explaining how distributed data storage works. Most of this part outlines all the problems with distributed data, and it does a good job of it. Once it reaches possible solutions to the problems (in chapter 9), it becomes vague. I've no lack of trust in the author's knowledge of the topic, but I found the descriptions unhelpful. Perhaps it's just not that easy to explain consensus algorithms in an understandable manner. The book basically lost me there.
Part III was hard to get through. The book would have 400 pages long without part III, and in my opinion much better. The last two hundred pages are essentially just walls of text, with no code, no diagrams, no illustrations. Since the first parts of the book included diagrams and technical details, it seems clear that part III could have been much better. Now it just felt rushed.
Had the entire book been like part III, I would have given it one or two stars, but parts I and II pull up the rating. Quite uneven.
Fantastic book, it took me almost 9 months to finish, but I am glad that I did, I think this book is a very important read to anyone building any application/system that use data in any way, shape or form. Highly recommended.
You almost had me till the very end, Martin Kleppmann, but I will not let that ruin my experience in reading this little book of yours.
Going in, I thought I would be reading something like the classic System Design prep Github repos with a lot of information told very quickly. You should know that this is purely about the data part: Kleppmann goes in depth on databases, message brokers, batch processing from the perspective of how the pieces of data are affected. There is less on pure infrastructure or testing or CI/CD other than what strictly pertains to the data.
I liked that the structure of the book meant that chapters were building upon themselves. The start is a standalone database, one machine, then we move on to multiple machines, we move them to different datacenters, we make the latency requirements and the throughput more ambitious. You can follow along with minimal experience in the domain and it doesn't shy away from making certain generalisation, which is ultimately the point of this whole book, that most systems nowadays look at getting data from point A to point B while enriching it along the way:
I learned how to think more critically when it comes to data quality, analysis and its general flow through the system, particularly: - Random additional latency, this will always remain non-deterministic because you cannot account for context switches to background processes, packet loss and retransmission, garbage collection pauses or page switches; - Physical clocks and their perills; - A 1-second slowdown in responses can dramatically reduce customer satisfaction (even by over 15%!); - Ways of structuring a database under the hood and what this means for updates coming in, handling in place vs. via a log; - What you might still keep in disk depending on size, even though it may cause random I/O; - Caching techniques, such as storing your last evicted used data somewhere, and loading it back into memory later; - Ordering events in a replicated and partitioned environment; - Data cubes and star data; - When a manual failover is actually more appropriate; - Completely leader-less techniques for conflict resolution; - Anti-entropy processes; - Multi-indexed databases and how this runs under the hood, including concatenated indexes.
The end I thought was very meh, though. It had the air of 'let's summarise everything we've learned', with Kleppmann dedicated a full chapter to his vision of how data-oriented systems will evolve in the future. Ultimately, it is taking all the information from previous chapters and looking at what's missing, what would add an improvement to the current state of things. And then there are around 20 pages or so about the importance of data auditing and its associations with surveillance, and the responsibility that the software engineer has in the process. Interesting sure, but nothing groundbreaking that you wouldn't have heard in election-times, perhaps related to Facebook or Palantir, so I ended up skimming that chapter because I thought it was written in a style that I considered too pompous.
Finally read cover to cover - I've started it multiple times, or jumped just to some specific topic. It's a wonderful book if you're interested in big data & related fields. The main problem with it is that it's easy to go the rabbit hole of individual links to papers, articles, etc. This time I didn't try to read every paper, but collect most interesting/useful in a separate folder.
Was it easy to read: It may have been the hardest thing I’ve ever read. The writing style is actually nice and quite colloquial for a technical book. But there is so much information in it! It took me ages (half a year) to get through it by taking notes when reading.
What I liked about it: The amount of information and the wast contexts that it covers: important concepts likes response time percentiles, linearisability, serializability and etc. explained; deep dive into database theory, different systems architectures with pros and cons; examples and explanations on how some widely used software products are implemented and even some fun facts like how the NoSql name was born and the Sushi principle (“Raw data is better”). Oh, and a very nice chapter on the ethics of software development at the end!
What I disliked: Reading the book felt like talking with a really smart dude that sometimes forgets to give contexts for some very technical stuff he mentions. I was missing some foreword for each chapter to explain why is this important for me as an application developer, why we talk about this here and etc. Also since the book covers wast oceans of information some parts of it are less useful/ easy to understand for the day to day job than others.
But this doesn’t prevent it from being a great read for anyone working with distributed systems. Must be the reason why it is O'reilly’s second most popular book for 2019.
In this category, this is, perhaps, one of the best books that exist on the subject; however there’s nothing on this book about how to specifically design my own data-intensive applications. This is more an overview of different distributed database design ideas and the challenges of designing proper distributed database systems and applications. As an overview of those topics, this book is awesome, but it failed to delivered what the title proposes. I really felt enriched by broadening my understanding of the different challenges of distributed computing and by the numerous references to further material. The book is a good introduction to a number of related topics, but not enough information is covered in the book to consider that, after reading its pages, I’m ready to solve any of the problems explained in its chapters or to design a proper data-intensive application. I found it rather long, theoretically shallow and unpractical for me. However, it was an interesting read and I did learn a few valuable lessons.
Actually closer to 4 stars, but it’s probably unfair because a lot of work has gone into researching and writing this book and I don’t want to bring down the average.
My issue with this book is the title does not match the contents of the book. The book deep dives into inner workings of databases, message systems and other storage solutions, but it rarely actually gives any advice on “building data-intensive applications”. A more appropriate title would be “how do databases work”.
There’s still a lot of useful information and advice in this book but it doesn’t deliver on what it promises on the cover.
Konu ile ilgili herkesin muhakkak okuması gerektiğini düşünüyorum. Mimari ve Veri Mimarisi diye ikiye ayırabilirim kitabı. (İkinci kısmı pek ilgimi çekmese de) meseleyi çok iyi irdelemiş yazar. Hayran olmamak elde değil.
This book changed my view to designing application! What is the meaning of Data-Intensive? We call an application data-intensive if data is its primary challenge- the quality of data, the complexity of data, or the speed at which it is changing.
Who should read this book? I think that all developers must read this book. If you develop applications that have some kind of server/ backend for storing or processing data, and your application use the internet, then this book is for you.
Why should you, as an application developer, care how the database handles storage and retrieved internally? You do need to select a storage engine that is appropriate for your application, from the many that are available. In order to tune a storage to perform well on your kind of workload you need to have a rough idea of what the storage engine is doing under the hood.
What is scope of this book? This book compares several different data models and query languages and turns to the internals of storage engine such as Cassandra, Redis, MongoDB, and etc. and looks at how database lay out data on disk. It also pays to distributed data and distributed system and their challenges like consistency, scalability, fault tolerance, and complexity.
A fantastic book that should be mandatory reading for all software developers. It covers databases and distributed system in a detailed and accessible way. Very clear writing, good diagrams and illustrations, and no fluff. I have written a summary of all the chapters on my blog: https://henrikwarne.com/2019/07/27/bo...
This took quite a while to work through, but was definitely worth it.
Some stuff on distributed systems felt like it was presented circuitously--Kleppmann would begin with ideas that struck me as obviously flawed only to later correct them iteratively before arriving at a better (and to me more obvious) approach. This might be because of some of the papers I'd already read on the topic though.
I really like how Kleppmann generally tries to focus on first principles and build up from there. He also constantly gives concrete examples about how certain companies or vendor products approach these problems, and while I might not remember many of those specifics it was useful to be able to say explicitly where a concept can be found in the wild.
This book is monumental. It explains many aspects of designing data applications in a very approachable way. It has everything; from high level differences between SQL and NoSQL to low level details of how databases work. The explanations are clear and accompanied by code samples, diagrams and examples of data engines that work that way.
Part I of the book covers the fundamentals (e.g. how to handle data on a single machine). Part II covers Distributed data: how to handle it and issues you'll face. Part III covers generating derived data (batch and stream processing) and the author's opinion on the future of data systems.
If you've ever wondered how does a database store data, what's the difference between different transaction isolation levels, what to do when the data doesn't fit in a single machine, what the heck is a data lake, what's the difference between a star schema and a snowflake schema, how does MapReduce work (or a hundred other data related questions), then this book is for you. I've learned so much from this book and I think it's recommended for anyone working in IT.
Probably the best written technical book I ever read. Martin Kleppman is vastly knowledgeable about all types and classes of databases and principles of data processing, but also uncannily talented in teaching others with clarity and a pinch of subtle humour. He covers the entire map of the territory that are data processing principles and systems with great detail (and delightfully toys with the map metaphor at the beginning of each new chapter), yet never gets bogged down. The book finishes off with an insightful holistic overview and a cautionary look into the future of data systems and their growing role in the human society. I would recommend this book to anyone involved with software systems on a technical level.
I have finished reading this book, once. I have to come back to this book to re-read some chapters. I need to do further studies from the references. I think it will be my companion for a long time in future. And yes, I agree with many readers of this book - this should be a required reading for programmers.
Хорошая, годная книга, но перевод местами подкачал (порадовали картографы и редукторы в главе про MapReduce). Интересная заключительная часть про будущее информационных систем — понравилась концепция «алгоритмической тюрьмы». В целом — полезное чтиво.