Jump to ratings and reviews
Rate this book

Database Internals: A deep-dive into how distributed data systems work

Rate this book
When it comes to choosing, using, and maintaining a database, understanding its internals is essential. But with so many distributed databases and tools available today, it’s often difficult to understand what each one offers and how they differ. With this practical guide, Alex Petrov guides developers through the concepts behind modern database and storage engine internals.

Throughout the book, you’ll explore relevant material gleaned from numerous books, papers, blog posts, and the source code of several open source databases. These resources are listed at the end of parts one and two. You’ll discover that the most significant distinctions among many modern databases reside in subsystems that determine how storage is organized and how data is distributed.

This book examines:

Storage engines: Explore storage classification and taxonomy, and dive into B-Tree-based and immutable log structured storage engines, with differences and use-cases for each
Distributed systems: Learn step-by-step how nodes and processes connect and build complex communication patterns, from UDP to reliable consensus protocols
Database clusters: Discover how to achieve consistent models for replicated data

376 pages, Paperback

Published November 4, 2019

Loading interface...
Loading interface...

About the author

Alex Petrov

1 book45 followers

Ratings & Reviews

What do you think?
Rate this book

Friends & Following

Create a free account to discover what your friends think of this book!

Community Reviews

5 stars
146 (48%)
4 stars
99 (33%)
3 stars
47 (15%)
2 stars
7 (2%)
1 star
1 (<1%)
Displaying 1 - 30 of 36 reviews
Profile Image for Sebastian Gebski.
920 reviews785 followers
February 9, 2020
One of the best tech books I've read in the last 12 months.
It consists of 2 parts: DB internals & DB distribution internals.

The 1st part is pure gold - one can learn about B*-trees, LSM-trees, differences between locks and latches, memory VS disk optimizations, rebalancing, concurrency models for transactions and much, much more. I can't recall any single book that covers as much deep-level knowledge on these topics.

The 2nd part is less unique - there are other good resources on distributed systems. What I liked was a solid description of Paxos algorithm (including variants: multi- and fast-). A chapter about anti-entropy was very solid as well (it's a pity I didn't have any comparable resource on the topic before I've started working with Cassandra).

No point in extending this review further - it's a great book, just grab it (if you're keen on the topic) - you won't find anything better.
Profile Image for Bugzmanov.
180 reviews39 followers
December 13, 2019
I liked this one a lot.

It complements nicely "Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems" by Martin Kleppmann.
While Kleppmanns' book provides pretty solid overview of data processing landscape, this book goes deeper into implementation details, data structures and algorithms.
It's a bit more dry and more technical, but still it's a relatively easy read.
Profile Image for Emre Sevinç.
143 reviews270 followers
May 29, 2020
“Database Internals: A Deep Dive Into How Distributed Data Systems Work” by Alex Petrov, belongs to a very special category of O’Reilly books such as “Designing Data-Intensive Applications” and “Cassandra: The Definitive Guide“, in the sense that it is a serious deep dive into the most fundamental and challenging aspects of big and distributed data systems that we rely on daily basis.

Today there’s an unprecedented proliferation of distributed database technologies, combined with an ever-growing multitude of cloud computing services offering them, as well as rapid advances in physical storage systems such as NVMe SSDs that force considering different trade-offs and algorithms. A working software developer, system engineer, solution architect, or a CTO can easily be overwhelmed with so many distributed NoSQL, newSQL, time-series, graph, document, key-value, embedded databases in addition to typical, traditional, enterprise RDBMS variants. Luckily, all of these fancy distributed database technologies are built on a limited number of concepts, techniques, and algorithms which are concisely introduced and surveyed in “Database Internals” book.

Read the rest of the review...
Profile Image for Adrian.
118 reviews11 followers
September 18, 2020
Unfortunately i have read this book after Martin Kelppman's Designing Data Intensive apps and probably this is the reason i rated it poorly.

The book starts really good by describing the intetnal systems that are encompassed in any dbms ( Connection Listener layer , Query parser+ optimizer layer , execution layer and of course the storage layer).

I wanted the book to explore more on this subject , how the components are drsigned and how they deal with concurrency etc.

The book then took a deep dive into tree-based data structures ..as in really deep.
I found myself at times wondering why am i reading this.It was way too terse.

Now after this part the book became interesting once again by the time it started tackling distributed transactions,consensus ,replication,byzantine faults, paxos algorhitms.

I was already familiar with all of those which in my humble opinion were tackled slighty better in the first book i mentioned.

All in all a good book , but i wouldn't classify it as complementing Martin Klepmann's
Profile Image for Mikhail Filatov.
181 reviews7 followers
April 2, 2020
The book is a strange mix - the second part is really about distributed data systems and it's ok - while "Designing Data Intensive applications" is better in this part.
The first part contains a lot of descriptions of different implementations of B*-Trees (replace * with any other symbol(s)) - most of them unreadable.
Profile Image for Bilal.
108 reviews5 followers
April 28, 2020
The book is divided into two parts: The first part deals with storage on hard disk and solid state storage but in the context of a singular system; while the second part deals with distributed systems. In this sense it differs from most other books on distributed storage that typically do not discuss the topics in the first part of this book.

I found the book informative, but not very effective in building a solid understanding of concepts. I felt the author jumps from idea to (related) idea too frequently in the manner of short paragraphs, and in so doing doesn't see an idea through to the end in enough detail for it to be learned properly. Perhaps the first part was better presented; the second was not.
November 5, 2020
Good and interesting content. But some chapters are scattered with missing transitioning between topics or algorithms. Some other parts are developed well. Diagrams are missing where I have expected better explanation or are present for obvious things. The writing is 3/5, but the content is 5. So it is 4/5. I'm glad I've read the book and definitely will get back to some chapters to refresh some details.
Profile Image for Łukasz Słonina.
122 reviews16 followers
April 7, 2020
I like this book for the content, if you would like to know more about databases and distributed systems plus get long list of further reads then go for this book. What I don't like is actually that this material does not read like book (e.g. DDIA), it's more like compendium of algorithms, data structures and theories. Some of the algorithms could be better presented (more diagrams).
Profile Image for Marcin Golenia.
33 reviews5 followers
April 25, 2021
The book is organized into 2 parts and let me review the book in two parts.

1. Storage Engines (5/5)
I didn't expect that we will get so much into internals. Hats off! The knowledge of the author is extensive and nicely presented by the book text and illustrations. Everything you need to know - hardware (HDD, SSD) and the relation between them and data storage algorithms, transactions, slotted pages, b-trees variants, LSM. Great stuff.

2. Distributed Systems (4/5)
There is a gap between complexity level in introduction of topics and the actual "meat" that is described. It starts nice and easy then boom! "Percolator transaction execution", "Calvin" and "Spanner". Hard stuff and this part may keep you reading the page few times so you can understand it. I failed to do so in few places but I am fine with this - The author provides many references so you can learn the intermediate knowledge from there.

I know I will forget some of the advanced things that Alex tried to explain to me, but I will know where should I look for it ;) All in all the book will help you to build nice end2end understanding of how the database really work in both places - your computer and a big cluster.

Profile Image for Ahmad hosseini.
265 reviews64 followers
February 20, 2021
I part 1, book explains internal database structure in details and examines its parts like storage engine very well.
“The storage engine (or database engine) is a software component of a database management system responsible for storing, retrieving, and managing data in memory and on disk, designed to capture a persistent, long-term memory of each node.”
Part 2 explains distributed systems characteristics in general and examine some specific topics related to distributed databases.
Book also introduces good sources for further reading
Profile Image for Bartosz Sypytkowski.
39 reviews9 followers
August 9, 2020
This book comes along nicely together with "Designing Data-Intensive Applications" by Martin Klepmann: they both focus on core, fundamental concepts of persistent, distributed systems, providing wide variety of known algorithms and protocols for common problems in that area, including rationale behind each one, which helps to build intuition about their trade offs. It's also full of references for anyone, who wants to continue more in-depth exploration for a given topic.
Profile Image for Leonid.
159 reviews12 followers
April 5, 2022
I've found some similarities between this one and my favourite "wild board book" ("Designing Data-Intensive Applications" by Martin Kleppmann), though in the first part, dedicated to DB internals, it goes much more in-depth, and there are tons of awesome stuff there. Second part, dedicated to Distributed systems, is a bit less unique, but still very good, and worth reading.
So, wholeheartedly recommend to anyone working with DBs, or just distributed systems in general.
4 reviews
February 1, 2021
This book really feels like two incomplete books in one. The first half of the book focuses on database internals, file formats, caching strategies etc. The second half of the book switches gears and dives into the components (algorithms, and strategies) used by distributed systems.

The problem is that there is nothing to tie the first and second parts of the book together. You could be reading entirely different books. The second issue is that even within each part, you are presented with a lot of great information, but there is no guidance (imo) on how you may want to logically put the pieces together to build a complete system.

True to the book's title, it does a great job of exploring the internals of database systems. If you are looking for how a specific component is built (ie you want to learn more about RAFT consensus), this is a great resource. If you are looking for a book that ties these concepts together however, I would suggest looking elsewhere.
Profile Image for Vishwanath.
42 reviews5 followers
May 3, 2020
Informative but would have preferred more examples with practical scenarios. No code and this all mostly conceptual. Some good references to papers for subsequent reading. The first part of the book deals primarily with storage and covers an in-depth discussion of b-trees and types. The second half is focused on distributed systems and has useful sections on consensus protocols. Concepts like "2-phase commits" are explained well with figures. However, the lack of practical examples/code and overall dry subject matter made this a laborious read. Good book to reference theoretical concepts.
Profile Image for Lauro Caetano.
8 reviews4 followers
April 2, 2020
Excellent book! It goes a bit in the direction of what Design Data-intensive applications goes when it talks about distributed systems, dist transactions and so on.
But this book goes some steps further: explaining how the db represents data internally, and also explaining distributed systems algorithms.

Excellent read!
Profile Image for Ivan.
221 reviews9 followers
December 30, 2020
Детальное, но без больших подробностей (это искупается большим числом ссылок и рекомендаций для дальнейшего изучения) описание структур и алгоритмов для современных систем.
16 reviews1 follower
December 31, 2021
Great, content-wise. Although appeared to have lost the flow/transitions in describing concepts.
Had to continuously take notes and cross-refer them so as to prevent myself from loosing the flow
Profile Image for Ricardo Hernández.
108 reviews3 followers
June 24, 2020
A book that I really appreciated to expand and have a 360 degree refresh on Database essentials. The progression of the book is built in an organic way, parting from basic concepts at low level implementation to modern distributed challenges. This helps you build comprehension naturally, in constructs with some other technical books that jump from topic to topic without any respect to cognitive challenges.

Totally recommended to get a solid understanding on databases to help solve contemporary problems.
22 reviews
July 31, 2022
Solid coverage of concepts related to database technologies. I felt some concepts weren't super intriguing and I didn't think connected well. The distributed systems portion was not as well-done. I did learn some core ideas, but I don't feel as well-read on them compared to the database concepts. I also think a real-world section of what services are commonly used in the real world to handle these problems as well as their trade-offs would be a nice addition.
Profile Image for Idir Yacine.
49 reviews1 follower
June 19, 2022
I believe this book is meant specifically for database admins and engineers (low level stuff) , As such if you are trying to get a high level understanding (practical code examples ,real life implementation examples) to get the job done this is not the book for it . All said the book is still pretty decent and worth the read .
8 reviews1 follower
January 4, 2023
I had a mixed feeling about this book. The first part deals with the concepts behind DBMS, storage engines, B-tree, disk storage,... which is very informative. The second part is really about distributed data system. It's not like that I was reading a book, but like I was reading theories or slides from lectures at university. Again, the book is very informative but not easy to read.
September 22, 2020
This book seams unbalanced for me.
Too complex for technical overview, too shallow for deep dive.
But it is a good starting point to learn about different areas of database and distributed system design.
Profile Image for Lu Pan.
2 reviews3 followers
October 19, 2020
The first part of the book is better. The part two which focuses on distributed system is less than a deep dive and I agree with other reviewer that Data Intensive is a better book on distributed system. But this is still a great book on single host database!
Profile Image for Aboullaite Mohammed.
9 reviews3 followers
January 19, 2021
I liked very much reading this book! It complements nicely "Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems" Book by Martin Kleppmann. I highly recommend!
Profile Image for Edmund.
30 reviews
January 19, 2022
I finally managed to complete this book on my third reading. While the book has a lot of valuable information in it, I personally think it tries to do too much in too little space. This makes the writing style quite terse and difficult to get through.
Profile Image for Matt Eland.
19 reviews5 followers
February 13, 2022
This book nearly made me cry. Such depth and detail, but it's a good look into the problems a database must overcome and the various algorithms that enable reliable, high performance, distributed database systems.
6 reviews
April 5, 2022
Probably the densest technical book I have read which featured very little (if any) code. That being said, my admiration for databases (and understanding) that we take for granted to simply 'work' has increased.
199 reviews4 followers
January 4, 2023
* goes quite in-depth; I took what I wanted
* types of database systems - HTAP was interesting to read about
* compares row- and column-store databases
* B-Tree, LSM, Skiplist, Hybrid Gossip protocol are some things that are discussed in detail
Profile Image for SolidM.
171 reviews1 follower
October 20, 2020
Le livre est partagé en 2 parties : algorithmiques et distributed systems.
Un must-read pour approfondir ses connaissances sur les DB.
Displaying 1 - 30 of 36 reviews

Can't find what you're looking for?

Get help and learn more about the design.