Rate this book

Big Data Principles and best practices of scalable realtime data

Name: Big Data Principles and best practices of scalable realtime data
Rating: 3.82 (49 reviews)

Nathan Marz, James Warren

Rate this book

Services like social networks, web analytics, and intelligent e-commerce often need to manage data at a scale too big for a traditional database. Complexity increases with scale and demand, and handling big data is not as simple as just doubling down on your RDBMS or rolling out some trendy new technology. Fortunately, scalability and simplicity are not mutually exclusive—you just need to take a different approach. Big data systems use many machines working in parallel to store and process data, which introduces fundamental challenges unfamiliar to most developers.

Big Data teaches you to build these systems using an architecture that takes advantage of clustered hardware along with new tools designed specifically to capture and analyze web-scale data. It describes a scalable, easy to understand approach to big data systems that can be built and run by a small team. Following a realistic example, this book guides readers through the theory of big data systems, how to implement them in practice, and how to deploy and operate them once they're built.

Big Data shows you how to build the back-end for a real-time service called SuperWebAnalytics.com—our version of Google Analytics. As you read, you'll discover that many standard RDBMS practices become unwieldy with large-scale data. To handle the complexities of Big Data and distributed systems, you must drastically simplify your approach. This book introduces a general framework for thinking about big data, and then shows how to apply technologies like Hadoop, Thrift, and various NoSQL databases to build simple, robust, and efficient systems to handle it.

GenresProgrammingComputer ScienceTechnicalTechnologySoftwareNonfictionArchitecture

Hardcover

First published January 1, 2012

99 people are currently reading

1402 people want to read

About the author

Nathan Marz

3 books10 followers

What do you think?

Rate this book

Friends & Following

Create a free account to discover what your friends think of this book!

Community Reviews

5 stars

131 (26%)

4 stars

188 (38%)

3 stars

134 (27%)

2 stars

27 (5%)

1 star

10 (2%)

Displaying 1 - 30 of 49 reviews

Szymon

2 reviews3 followers

February 11, 2014

The first chapter is definitely worth reading. Maybe second one too. The rest is way too focused on specific technologies. And so it happens, the technologies happen to be created by the authors. Too much advertising, not enough of the big picture.

big-data

Sebastian Gebski

1,199 reviews1,372 followers

May 13, 2015

Controversial book.

1. Worst title ever. If it wasn't Nathan Marz (father of Storm), I'd never pick it up.
2. It's not just bad title - this book is NOT about Big Data - or rather, it's about one particular "pattern" of Big Data usage - Lambda Architecture.
3. It's great in terms of distributed processing / storage considerations - best micro-batching description / analysis ever
4. Particular product (Kafka, Hadoop, Storm) descriptions are ... controversial. Not a deep dive, but author(s) don't constrain themselves when putting code samples - I don't think it's possible to understand them without external knowledge about these components. It wasn't much of a problem in my case, but may be for others. And that makes me wondering who is this book made for?

In the end, I've enjoyed the content.
Without doubt, authors know their stuff & they put a lot of effort to present their knowledge, their thoughts - but this book feels a bit like a mosaic of some great "articles" that scream to be shared then the holistic, well-targeted book.

Bodo Tasche

97 reviews12 followers

June 6, 2018

Sadly not my kind of book. Starting with several examples that use "Gender". From a Gender-Field inside of the database that just knows "Male" and "Female" to an example that tries to guess the "Gender" based on the first name.

On top of that it is filled with unnecessary and bad diagrams that basically are explained in one sentence, but the authors thought that it might be good to also put 3 boxes and some arrows between them.

The source code examples are not very good too. With arrows pointing to explain what the code is doing. One hint: if you need 5 arrows explaining things in a 7 line function, maybe you should try to find a better code example that doesn't need this?

Ulas Tuerkmen

39 reviews1 follower

October 8, 2018

It wouldn't be an exaggeration to say that Nathan Marz, as the original developer of Storm (together with many other relevant pieces of software, such as Cascalog) is among the inventors of the whole Big Data thing. Storm has enabled complicated real-time pipelines to be built, without the headaches of coordinating data transmissons and routing. It is thus a boon that he, together with James Warren, went on to write a book on the exact same topic, sharing the tips and ideas that went into building Storm. As such, it is not a surprise that the book is a great overview of the field and fundamental techniques, and has become standard reading already.

The demands on a big data system, essentially an OLAP application that has to scale linearly with the amount of input, is fundamentally different from those on an OLTP system, which developers are normally used to developing. These requirements are robustness (also in the face of human error), scalability, modularity and ad hoc queries (i.e. joins). The path taken by many development teams in the face of these differing requirements is the orchestration of existing tools,coupled with simply more code. As the authors point out, this approach leads to overly complex, fragile systems, because the transactional tools were not built for reliable and robust computation, and bring in too much complexity overhead with them. The alternative is to start with design principles and accompanying family of software tools that give you these requirements from the beginning, when placed into the right design philosophy. The name given by the authors to this design philosophy is Lambda Architecture.

The lambda architecture starts with the principle of immutable data. Data is the raw bits and bytes the system receives, and cannot be derived from anything else. It is at the beginning of the information dependency chain, so to say. Turning this data into a useful form and storing it is the job of three different layers of processing: Batch layer, serving layer and speed layer. The batch layer is responsible for running preprocessing on the original data to turn it into more accessible form. It has to be performant, scalable,and tolerant to human error. These properties are achieved by using simple storage solutions such as the file system, recomputation algorithms on immutable data, and parallel computation. What the batch layer does not need to be is low-latency. The computations are allowed to run over longer periods of time, in the order of tens of minutes,and work on complete sets of raw data. A central theme of the book,alluded to above already, is avoiding accidental complexity by reducing each layer to the necessary minimum of functionality. In the batch layer, this translates to keeping data immutable in terms of storage, and using recomputation algorithms to create the batch views. Recomputation algorithms have three advantages compared to incremental ones: They can be faster, error correction is recomputation, and they tend to be simpler.

The obvious choice for batch processing is Hadoop, and it is not any different in this book. Hadoop combines scalable, distributed storage and parallelized computation with HDFS and MapReduce. The authors go into some detail on storing and processing data on Hadoop using the Pail data partitioning library and the JCascalog data processing and querying library. One of the weaknesses of the book is obvious in this chapter. Hadoop is not a breeze to install, and the code examples in the batch processing chapters are there only for the reading; they are not particularly 'hackable'. Also, the code is in Java, which might make sense considering the target audience and the fact that Hadoop and the other big data tools are written in it, but it's not the prettiest code to look at. I ended up not even skimming the Java code,since it's not my favorite way of spending time, and just read the textual explanations. The examples picked by the authors (unique views per time window with multiple IDs per user and bounce rate analysis) are fortunately not too simplistic. I can imagine that the code examples are relevant for people who use Java to implement similar things.

The batch layer processes the mass of incoming data to precompute batch views, condensed data that can be stored and easily combined to generate information that is of interest. Data is condensed in two senses: Accumulation and correlation. Accumulation is the calculation of measures of data, such as counts or averages, while also filtering those parts that are irrelevant. Correlation is the combination of rows of data based on column attributes, the simplest being concatenation, more complex cases being kinds of joins. The authors give examples of these operations for sample algorithms first in the standard Java way of writing MapReduce, then using an alternative library called JCascalog. JCascalog allows the description of parallel computations in a style much more similar to pipe diagrams, decoupling computation from physical storage. The discussion of JCascalog is rather in depth, but again, it would have been very useful to have a virtual machine or similar container in which the reader could easily poke around the examples, and maybe even solve a few exercises. The code examples will not be making developers like me who eschew certain programming models such as extensive internal state big friends of Java, as the authors state that components like the aggregators function by "adjusting some internal state for each observed tuple" (p. 129). This appears to be a general pattern: Because Java allows only limited kinds of generic programming, a lot is done using strings and internal state.

The following two chapters are dedicated to the design and implementation of a sample batch layer for website analytics. The individual features selected for this example are of varying complexity on differing dimensions. Pageview counts by URL per time period splits data on the time dimension, whereas unique visitors by URL per time period also requires keeping track of which user visited which page. The unique visitors task is complicated by the fact that the same user can be identified with different IDs, and ID equivalence can come in after the user visits a page. The last feature,bounce-rate analysis, is again different in that it requires tracking the time difference between different visits by the same user. The implementation is explained in detail on actual code, which is a bit tedious at times, but would definitely be useful when you're working on actually implementing something.

The condensed data created by the batch layer is saved in the serving layer. The batch layer, by virtue of having access to all of the batch data, can condense it by the previously mentioned processes of aggregation and correlation so that there is not only less of it, but the data is mutated to enable efficient queries at the serving layer. These queries can require things like joins, grouping on columns, or calculating set cardinalities (made faster by approximate algorithms such as HyperLogLog). The serving layer has to be designed with the aim of presenting the condensed data in a reliable and rapid manner. Therefore, it should again be distributed to enable fault tolerance, and should allow indexes and collocation for fast retrieval of ranges of values. The first chapter on the serving layer goes into considerable depth to explain how an incremental approach that unifies read and write functionality would not be able to achieve similar performance to the batch & serving layer split. Afterwards, a sample serving layer that stores the results of the previously built batch layer is built, using ElephantDB as the storage engine. ElephantDB is a distributed key-value store explicitly built for exporting data out of Hadoop. One of its major features is that creation of indexes is completely separate from serving them. The indexes are created from shards of data at the end of a Hadoop job, and then fetched by the ElephantDB process during suitable load conditions. It is still not the ideal serving layer database, though, because it does not offer range queries or built-in HyperLogLog sets.

The last component of the lambda architecture is the speed layer. This layer is responsible for real-time processing of fresh data in a limited time window. In order to achieve speed, incremental algorithms are used in this layer, but error correction is still not done by correcting results, but by letting invalid results fall out of the window of processing. The requirements for view data storage in this layer is different from those in the serving layer. Since incremental algorithms are used, batch creation of sharded indexes is not enough; random writes should also be allowed. The correctness requirements are also laxer. Since the results will be improved when the batch layer kicks in and processes the complete dataset once it falls out of the real time view window, approximations for the sake of speed are welcome in the speed layer. This is called eventual accuracy. Due to the use of incremental algorithms and the general availability requirements on all layers, speed layer storage faces particular complexities. One of these is the CAP theorem, which concerns the consistency vs availability trade-off in the presence of network partitioning. Since distributed storage systems are used, partitioning is a condition that is definitely to be accounted for, and in the presence of partitioning, special methods called conflict-free replicated data types (CRDTs) have to be used to achieve incremental algorithms. There are two sets of tools that can be used to deal with these complexities. The first is asynchronous updates, where the data in the store is updated not individually from each speed layer process, but queued in a bust which can also buffer for batch updates. Another is expiring the views that are old enough to be included in the batch layer, and can be incorrect.

The sample implementation for the speed layer starts with a storage for realtime views, built on the Cassandra data store. Cassandra is a column-oriented database which the authors prefer to describe as a map with sorted maps as values. The data is arranged in column groups,which are themselves key-value mappings, where the values are also sorted key-values themselves. These are collocated, so doing efficient queries of the first level of key-values is possible. A number of different patterns for processing data in real time and feeding into the data store are then discussed, such as single-consumer vs multi-consumer queues, one-at-a-time vs micro-batched processing, and queues-and-workers model vs the Storm model. Storm was also written originally by Marz, and uses an alternative processing model for fast stream processing. The processing pipeline is represented in Storm as a topology that consists of streams (sequences of tuples), spouts (sources of tuples) and bolts (which take streams and produce other streams). The path followed by a tuple in this topology corresponds to a directed acyclic graph (DAG), which can be thought of as an alternative to queues, in that instead of maintaining intermediate queues that track what is processed and what is not, the position of a tuple in the DAG is stored. This turns out to be a relatively cheap process, requiring only 20 bytes per tuple. When a tuple is found to fail, it is reprocessed starting from the spout. This way, an at-least-once guarantee similar to that provided by the queues can be given by Storm.

In the illustration for speed layer stream processing, a Strom topology for calculating the uniques-over-time view, and another for bounce rate analysis are built, with the help of Kafka and Zookeeper. The first example serves to illustrate simple Storm topologies, whereas second is for more complicated micro-batch processing. The first example is very Java-centric, also due to the fact that Zookeeper is used, and it reads like an exploded version of a more concise language. The second example includes a more interesting discussion of one-at-a-time vs micro-batch processing. One-at-a-time guarantees that a tuple will be processed,but failure tracking and replay happen at a per-tuple level. It fails to give certain guarantees that are required in precision for certain kinds of tasks that require an exactly-once semantics, such as counting. Exactly-once semantics can be achieved using micro-batch processing, in which batches of tuples are processed together, and the state is stored in terms of IDs for these batches. Each bolt stores the ID of the last batch that it processed, and when a batch errors,whether it was processed can be found by comparing IDs. In the demonstration section, the bounce rate analysis task is implemented using Trident, a library for building pipelines on Storm, Kafka and Cassandra.

As you can see from the length of this review, Big Data is a book with a lot of substance. Here is what this book does not tell you,however: How to analyze the data and derive insights out of it. Other than that, pretty much any topic relevant to big data systems is mentioned. If you are working on a big data system, there is no way around this book.

Gaelan D'costa

205 reviews14 followers

September 15, 2014

This book was my first exposure to an architecture for dealing with large amounts of data in a holistic way; while I'm familiar with individual concepts like MapReduce, Column Stores, CAP, etc... I've never thought about them all at as part of the same ecosystem. As such, my rating is based on the accessibility and readability of the book, not of the correctness and feasibility of the content.

This is the kind of technology stack my current employer is forming a business around and I want to get started as soon as possible; I was recommended this book by someone who has built these systems before. I managed to read through it in a day and never felt daunted or lost in the text. While there are certainly parts I chose to skim over because I feel I'll be better off examining them in depth while I tackle that particular part of the infrastructure, I feel the overall gist of what this book is enabling me to build was covered in a very understandable way. Even if I don't remember much of the book's particular details, I know when I'll need to revisit them and where I need to look.

To sum it up very briefly (and hoping I'm not messing up), this book spells out a proposed general architecture for processing huge amounts of data (The Lambda Architecture) and covers the five layers it comprises:
1. The data ingestion layer
2. The batch layer (for views that take a long time to process)
3. The serving layer (for serving the information generated by the batch layer)
4. The speed layer (for quickly showing derived information that has been added since the last batch, can also be used for real-time views)
5. The querying layer, to get back specific information.

Along the way it defines data to mean raw information, vs. what information we will derive by views. At each of these layers, the authors go into the things you will have to consider (algorithm choice, anticipated gotchas, the nature of the problems being solved) and use a concrete solution to demonstrate how those problems would be implemented. While particular pieces of software are chosen, they are used to discuss the issues in real-world terms and the book does a good job of not being beholden to particular implementations.

I have never read books by Manning Press before, generally choosing to stick to O'Reilly publications and the occasional Pragmatic Bookshelf if it involves Ruby. This book impressed me greatly, and it's still in the process of being read. I will eagerly look over the rest of Manning's catalogue to see if they reach this level of quality.

Abdul Qavi

31 reviews

December 14, 2015

An essential read to understand complete Big-Data ecosystems, technologies to use, and where does each technology fit. Though if you're looking for in-depth knowledge and discussion of one specific tool, you've come to wrong place.

If you keep in mind the understanding of complete big-data ecosystem, you will find the book interesting and engaging. If you expect to learn programming aspects of various technologies from this book, you would find this book boring. Though there are illustration chapters along with each theoratical chapter in this book, the illustration chapters aren't very much engaging.

8-technical-big-data

Selim Ober

4 reviews1 follower

February 20, 2016

As written on several other reviews, this book tells a story of one, opinionated approach to the problems in Big Data domain. The author, also the creator of many tools in the same domain explains the Lambda Architecture and how can it be used to solve problems faced in realtime data systems.

I enjoyed the book. He goes with one theoretical chapter following by an illustration chapter where he goes into the implementation of the previous one. I've read the first 4-5 chapters throughly, then read the theoretical ones and skimmed over the practicals. I guess you'll benefit from it if you're looking for an overview of the concepts and tools used nowadays.

Daniel

17 reviews

September 11, 2016

The motivation and concept of the lambda architecture is great. It is also really well explained - in the first chapter. The following chapters did not add much in my eyes and should have been condensed *a lot*. I did not finish the last chapter yet.

Ritesh Chhajer

24 reviews1 follower

January 7, 2019

The advantage of making data immutable is even when you make a mistake, you might write bad data but at least you won’t destroy good data. This is a much stronger human-fault tolerance guarantee than in a traditional system based on mutation. In a production system, it’s inevitable that someone will make a mistake sometime, such as by deploying incorrect code that corrupts values in a database. By building immutability and recomputation into the core of a Big Data system, the system will be innately resilient. In a relational world, you constantly update and summarize your information to reflect the current state but this approach also limits the number of questions you can answer with data. Ideally you want to store the rawest data. The rawer the data, the more questions you can ask of it. Storing raw data is hugely valuable because you rarely know in advance all the questions you want answered. By keeping the rawest data possible, you maximize your ability to obtain new insights, while summarizing, overwriting or deleting information limits what your data can tell you. Unstructured data is rawer than normalized data.

Data systems (data pipelines) don’t just memorize and regurgitate information. They combine bits and pieces together to produce their answers. Not all bits of information are equal, some information is derived from other pieces of information. When you keep tracking back where information is derived from, you eventually end up at information that’s not derived from anything. This is the rawest information you have: data - that special information from which everything else is derived.

Lambda Architecture:
Batch layer stores the master copy of dataset which is immutable and constantly growing (except when you need to perform garbage collection to delete data units that have low value to implement data retention policies for controlling the growth of master dataset) and precomputes batch views on that master dataset. The master dataset is the source of truth. Errors at serving and speeds layers can be corrected but corruption of master dataset is irreparable. Fact based model stores your raw data as atomic facts, keeps them immutable and eternally true by using timestamps and ensures each fact is identifiable so that query processing can identify duplicates.

The next step is to load the batch views somewhere so that they can be queried. This is where the serving layer comes in which is a specialized distributed database that loads in a batch view and makes it possible to do random reads on it.

Speed layer only looks at recent data, whereas the batch layers looks at all the data at once. The speed layer does incremental computation instead of recomputation done in the batch layer. The speed layers supports random reads and random writes which creates complexity around online compaction (As a read/write database receives updates, parts of disk index become unused, wasted space. Periodically the database must perform compaction which is a resource intensive process to reclaim space.) and concurrency (A read/write database can potentially receive many reads or writes for the same value at the same time. It therefore needs to coordinate these reads and writes to prevent returning stale or inconsistent values. Sharing mutable state across threads is a notoriosuly complex problem and control strategies like locking are notoriosuly bug prone).

The beauty of Lambda architecture is complexity isolation meaning once data makes it through the batch layer into the serving layer, the corresponding results in realtime views are no longer needed and can be discarded. Unlike SQL query planner in relational database world, here very little magic is happening under the hood which actually is good as it leads to a more predictable performance.

CAP theorem: When a distributed data system is partitioned, it can be consistent or available but not both. If you choose consistency then sometimes a query will receive an error instead of an answer. If you choose availability then reads may return stale results during network partitions. Eventual consistency means system returns to consistency once the network partition ends.

The batch and serving layers are distributed systems and are subject to CAP theorem. The only writes in the batch layer are new pieces of immutable data. If data can’t be written to the incoming store in the batch layer, it can be buffered locally and retried later. As for the serving layer, reads are always stale due to high latency of the batch layer. Both batch and serving layers choose availability over consistency.

Synchronous updates are typical among transactional systems that interact with users and require coordination with the user interface. Asynchronous updates are common for analytics-oriented workloads.

Tim

78 reviews3 followers

July 9, 2018

It looks that the main complaint of readers who did not like this book is that it is basically a promotion of the Lambda Architecture (developed by the book's authors). Even as those readers are right, they are nevertheless wrong. The book is an introduction to the world of Big Data, and while of course there is more to Big Data than Lambda Architecture, Lambda is a very decent entry point. Admit it, no book you'll read is going to have a thorough overview of all existing technologies (and even if you find one trying to do that, it is unlikely to do a good job), so you'll most likely be looking at one certain kind of architecture or the other anyway. So just go ahead and read about Lambda. You can always read more books later.

That out of the way, I loved how the book is structured. It is actually not that common to find an intro text which is so well organized and so meticulously covers the topics from both theoretical and practical sides. The authors did an awesome job there. I do wish the examples they chose to illustrate Big Data tasks were more exciting (counting website visits, unique visitors, words... yawn), but that's a minor flaw. The examples themselves are pretty realistic, at least. The book also assumes that readers have more than a basic understanding of Java, so if your Java toolbox is rusty or nonexistent (like mine), following practical examples might be difficult. Even so, the foundation laid in theoretical chapters is extremely solid and the theory itself is covered so well that you should not have much difficulties moving on to other books. I now know a whole lot more about Big Data then when I started reading this book, and, more importantly, a better understanding of how different pieces contribute to the Big Data ecosystem, and that counts for a lot.

Andreea Olaru

30 reviews

July 21, 2025

2 Main points i liked: having multiple datasets, each aggregated differently and using them all together for analytics reports and last chapter on the exponential impact on performance.

The things i found hard to grasp:

In the last chapter that presents a formula on how to estimate the time for your dataset to be computed by a cluster double in size i suspect the example with doubling the cluster for a 54 min dynamic data was wrong (i applied the formula for a p=54 min and it gave me a 91% improvement not 82%)

In the serving layer chapter, when explaining why a scan is much faster than a seek, i would have gone deeper with the explanations (how data is stored on disk, how is it retrieved for seek vs scan). I found the full picture hard to comprehend.

When discussing how to solve the equiv problem on the dataset, a sampling technique was brought up, that seemed to propose a "randomisation" on what pageviews to write to the dataset. I may have misunderstood, but to me that approach would highly compromise the integrity of the data, what point to have a millisecond query of the data returned by it is innacurate?

In the final chapter the book introduced the concept of partial recomputes for the batch layer and favored them over full recomputes on the master dataset, yet throughout the book i got the impression that the main benefit of the master dataset was its recomputation since it allows for error correction. Maybe i mix and minggle the concepts.

All in all, i found the book hard to follow if one does not use the technologies used to exemplify the concepts or if one is not a data engineer already.

Ahmad A.

78 reviews16 followers

April 4, 2019

2/5. It was ok. The book references several cool ideas and practices that people should be familiar with before designing data systems, such as: Partitioning, Bucketing, Data Modeling using Schemas ... etc, but in my opinion the book is tied to the technologies that the authors wrote themselves. I would have appreciated a more industry-standard tooling for the book and maybe offload the code examples in a separate repository and give people examples in more than on programming (they're written in Java). Also, the book contains uncommon terms for established architectures, a couple of examples:
- Speed Layer vs. Real-time Stream Processing
- Eventual Accuracy vs. Eventual Consistency

I am not sure whether I'd recommend this book to a beginner to the Data Engineering world, like myself, as it can confuse people with its uncommon terminology and uncommon adoption of tools like Pail.

comp-sci programming

Luiz

115 reviews

June 15, 2018

Assisti a apresentação do Nathan Marz no StrangeLoop em 2012 sobre a Arquitetura Lambda e isso sempre ficou na minha cabeça. Comprei o livro como MEAP na Manning e li partes ao longo do tempo.

Por causa da CargoX, resolvi reler. As partes mais low level não estão mais tão atuais - e até o conceito de Arquitetura Lambda anda sendo questionado - mas os capitulos iniciais, onde ele discute/define o problema moderno de processamento de dados, me valeram a leitura.

Não sei se recomendaria, entretanto, porque o assunto não morreu e tem coisas mais atualizadas em termos de tecnologia por aí.

technical

Oleg Tolmashov

23 reviews2 followers

August 17, 2019

The theoretical part is a good one. But to be honest I skipped many practical parts because the main concept was clear enough and practical parts weren't clear at all. Torn off parts of code didn't help me to get the main idea of the lambda architecture.
Now I know how to build a scalable and distributed system. To my mind, this book is worth reading if you are facing problems with the high load in your app.

technical

Wojtek Gawroński

122 reviews45 followers

August 31, 2020

At this moment this is a "classical" position in the landscape. And it did not age well.

That's a bummer, as this is a very good book, very thorough and precise. However, most of the elements (including the newest approach to Lambda Architecture) changed dramatically, after introducing new tools and approaches.

Stil it is a valuable book for people looking for more knowledge in the Big Data space.

Atif Shaikh

117 reviews

October 23, 2018

Fantastic book written by the founder of Apache Storm who takes an architectural approach but sprinkled with code snippets to introduce and elaborate Lambda Architectures.

Too bad this particular pattern has become an edge case and better options like Kappa are out. Nevertheless, I keep on recommending this book to those new to big data.

Christoph Kappel

472 reviews9 followers

July 10, 2023

Actually this book is quite good, with lots of visual examples and loads of code on a complete example.

It took me some cross-checks to discern, if the lambda architecture is a thing or just a product for the course of this book though.

I learned a lot and it really helped me pretty well to get the loose information about this topic sorted for me.

2023 big-data english

Bob Pore

87 reviews2 followers

April 11, 2020

Classic example of a book where you can get most of the core information by reading the first few chapters and the last chapter. I think a more suitable table would have been “Tackling Big Data with the Lambda Architecture.”

Shyam Poovaiah

22 reviews

November 8, 2020

This was my introduction to Scalable and Real time systems.
The emphasis on presenting the concept and the illustrations go a long way in helping solidify the learning.
Although, I am form a .NET background the Java libraries are extensively explored along with Natan's own libraries.

programming

Gorjan

10 reviews4 followers

June 6, 2017

Excellent introduction to the world of Big Data, and the application architectures at play...

ezequiel orbe

13 reviews4 followers

July 13, 2017

Nice book that contains many important concepts.

Jin Chen

3 reviews

Read

February 17, 2020

Only read first 2 or 2 chapters meticulously

Mohamed Metwalli

103 reviews1 follower

March 4, 2022

It is an essential read to understand complete Big Data ecosystems and the technologies to use.
The concept of the lambda architecture is great.

Diep Dao

14 reviews

December 3, 2022

Good overview of lambda architecture and real-time data processing system.

1 review

Big disappointment

24 reviews

A must-read book about real time

Nathan Gould

4 reviews

March 16, 2017

I read this book to sharpen my ability to think about design tradeoffs in the context of large data systems. I got what I expected, and more. Authors Nathan Marz and James Warren introduce their “lambda architecture” using a hypothetical data platform. Their software needs to deliver insights from a massive and continuously growing dataset, and it needs to deliver those insights in timely fashion to the customer. Those goals are seemingly at odds, since more data means more compute load, and therefore more latency before the customer sees results.

The authors propose splitting the problem in two: a batch layer to consume historical data and build data views from scratch on a regular basis, and a lighter-weight speed layer to process the most recent data, minimizing the user impact of batch processing latency. The authors dive into implementation specifics, and show how achieving speed and resource efficiency at scale requires new ways of thinking. By the end, you’ll understand, among other things, how serialization frameworks can help abstract away the complexity of raw data storage, and why multi-consumer queues are preferable single-consumer queues in a scalable speed layer. The authors explain common design patterns in accessible terms, so you don’t have to learn them the hard way (that is, through experience).

The pages go by quickly because of the authors’ compelling and opinionated presentation of concepts. One thread throughout the book is that raw data should be treated as immutable. Incremental, transactional architecture (think: CRUD apps) is not a viable approach to building big data systems because it is sensitive to human error and can land your system in nonsensical states during faults or network partitions. Marz and Warren suggest avoiding update operations altogether. It is preferable, they argue, to start from scratch and recompute everything from your raw data. That way all mistakes can be fixed with a code push.

Reading this book has informed how I think about data systems small and large. However, as with any technical book in a trendy field, keep an eye out for datedness. This book was published in early 2015 (and thus probably written in 2014). Since then there has been momentum toward unifying the batch and speed layers, in particular by using Apache Spark. As a result, the idea of complete separation between batch and speed layers may go out of fashion soon. That said, the real gift of this book is the authors’ well thought out and presented rationale behind the lambda architecture. If you are interested in building or understanding data platforms, it is well worth a read.

Gauri Kanekar

3 reviews

February 13, 2017

Fantastic book for big data beginners. It helps you understand the intricacies in building the big data systems.
Well explained Lambda Architecture.
Points out lot of problems faced in monolithic relational databases.

bigdata technology

Zbyszek Sokolowski

299 reviews16 followers

October 22, 2015

The title of the Book by famous Nathan Marz is just misleading. It is not about Big Data but about Nathan Lambda architecture I've read it from cover to cover. And I have mixed feelings about book. From one hand he explained a lot of big data concepts but rest is about implementation of his architecture using mostly with tools created by the author. And he focuses too much on his example which in turn makes book too closely tight to certain idea. He also mentioned optimizations which could be acceptable in his solution but not in others. And even though that title encourage reader to get acquainted with subject by misleading titled it should be book for expirienced devs which are acquainted with mentioned tools and wants to know Nathan opinion and implementation. Or someone who wants to broaden her his horizon and knowledge. But otherwise I would turn to the another book. And return to this if we need for example use Storm, Nathan framework.

big-data java programing

Horia

79 reviews8 followers

December 4, 2015

This is a book about Lambda Architecture and how it is used in the context of Big Data.
I enjoyed reading about lambda architecture and other related concepts and found them useful since I was a complete beginner in this domain.

The first few chapters are definitely worth the read, the rest of the chapters I consider architecture reference material (too detailed to be remembered and, at the same time, requiring knowledge about the technologies being discussed). There were quite a few moments while reading it when my mind started to wonder off...

To sum up: read the first few chapters and then let google (plus the source code that accompanies the book) be your guide

And of course, the review would not be complete without the criticism of the architecture.