Cassandra Data Modeling and Analysis Quotes

Rate this book
Clear rating
Cassandra Data Modeling and Analysis Cassandra Data Modeling and Analysis by C.Y. Kan
16 ratings, 3.38 average rating, 3 reviews
Cassandra Data Modeling and Analysis Quotes Showing 1-30 of 43
“A value of the timeuuid data type is a Type 1 UUID which includes the time of its generation and is sorted by timestamp. It is therefore ideal for use in applications requiring conflict-free timestamps. A valid timeuuid uses the time in 100 intervals since 00:00:00.00 UTC (60 bits), a clock sequence number for prevention of duplicates (14 bits), and the IEEE 801 MAC address (48 bits) to generate a unique identifier, for example, 74754ac0-e13f-11e3-a8a3-a92bc9056ee6.”
C.Y. Kan, Cassandra Data Modeling and Analysis
“If no time zone is specified, the time zone of the Cassandra coordinator node handing the write request is used. Therefore the best practice is to specify the time zone with the timestamp rather than relying on the time zone configured on the Cassandra nodes to avoid any ambiguities.”
C.Y. Kan, Cassandra Data Modeling and Analysis
“Because one row can hold as many as 2 billion variable columns.”
C.Y. Kan, Cassandra Data Modeling and Analysis
“The maximum size of a column key is 64 KB, in contrast to 2 GB for a column value.”
C.Y. Kan, Cassandra Data Modeling and Analysis
“Wide row It is common to use wide rows for ordering, grouping and efficient filtering. Besides, you can use skinny rows. All you have to consider is the number of columns the row contains. It is worth noting that for a column family storing skinny rows, the column key is repeatedly stored in each column. Although it wastes some storage space, it is not a problem on inexpensive commodity hard disks.”
C.Y. Kan, Cassandra Data Modeling and Analysis
“In Cassandra, however, sorting is by design because you must determine how to compare data for a column family at the time of its creation. The comparator of the column family dictates how the rows are ordered on reads. Additionally, columns are ordered by their column names, also by a comparator.”
C.Y. Kan, Cassandra Data Modeling and Analysis
“A query is always the starting point of designing a Cassandra data model. As an analogy, a query is a question and the data model is the answer.”
C.Y. Kan, Cassandra Data Modeling and Analysis
“In a distributed system backed by Cassandra, we should minimize unnecessary network traffic as much as possible. In other words, the lesser the number of nodes the query needs to work with, the better the performance of the data model. We must cater to the cluster topology as well as the physical storage of the data model.”
C.Y. Kan, Cassandra Data Modeling and Analysis
“The most important difference is that a relational database models data by relationships whereas Cassandra models data by query.”
C.Y. Kan, Cassandra Data Modeling and Analysis
“No sequence In a relational database, sequences are usually used to generate unique values for a surrogate key. Cassandra has no sequences because it is extremely difficult to implement in a peer-to-peer distributed system. There are however workarounds, which are as follows: Using part of the data to generate a unique key Using a UUID In most cases, the best practice is to select the second workaround.”
C.Y. Kan, Cassandra Data Modeling and Analysis
“Cassandra has no sequences because it is extremely difficult to implement in a peer-to-peer distributed system.”
C.Y. Kan, Cassandra Data Modeling and Analysis
“Instead, it encourages and performs best when the data model is denormalized.”
C.Y. Kan, Cassandra Data Modeling and Analysis
“Different names in columns are possible in different rows. That is why Cassandra is both row oriented and column oriented. It should be remarked that there is no timestamp for rows.”
C.Y. Kan, Cassandra Data Modeling and Analysis
“Column Column is the smallest data model element and storage unit in Cassandra. Though it also exists in a relational database, it is a different thing in Cassandra. As shown in the following figure, a column is a name-value pair with a timestamp and an optional Time-To-Live (TTL) value:”
C.Y. Kan, Cassandra Data Modeling and Analysis
“The Map data structure gives efficient key lookup, and the sorted nature provides efficient scans. RowKey is a unique key and can hold a value. The inner SortedMap data structure allows a variable number of ColumnKey values. This is the trick that Cassandra uses to be schemaless and to allow the data model to evolve organically over time. It should be noted that each column has a client-supplied timestamp associated, but it can be ignored during data modeling. Cassandra uses the timestamp internally to resolve transaction conflicts.”
C.Y. Kan, Cassandra Data Modeling and Analysis
“Conversely, Cassandra is designed to work in a massive-scale, distributed environment in which ACID compliance is difficult to achieve, and replication is a must.”
C.Y. Kan, Cassandra Data Modeling and Analysis
“Features of Cassandra In order to keep this chapter short, the following bullet list covers the great features provided by Cassandra: Written in Java and hence providing native Java support Blend of Google BigTable and Amazon Dynamo Flexible schemaless column-family data model Support for structured and unstructured data Decentralized, distributed peer-to-peer architecture Multi-data center and rack-aware data replication Location transparent Cloud enabled Fault-tolerant with no single point of failure An automatic and transparent failover Elastic, massively, and linearly scalable Online node addition or removal High Performance Built-in data compression Built-in caching layer Write-optimized Tunable consistency providing choices from very strong consistency to different levels of eventual consistency Provision of Cassandra Query Language (CQL), a SQL-like language imitating INSERT, UPDATE, DELETE, SELECT syntax of SQL Open source and community-driven”
C.Y. Kan, Cassandra Data Modeling and Analysis
“Another repair mechanism is called anti-entropy which is a replica synchronization mechanism to ensure up-to-date data on all nodes and is run by the administrators manually.”
C.Y. Kan, Cassandra Data Modeling and Analysis
“Hinted handoff aims at reducing the time to restore a failed node when rejoining the cluster. It ensures absolute write availability by sacrificing a bit of read consistency. If a replica is down at the time a write occurs, another healthy replica stores a hint. Even worse, if all the relevant replicas are down, the coordinator stores the hint locally. The hint basically contains the location of the failed replica, the affected row key, and the actual data that is being written. When a node responsible for the token range is up again, the hint will be handed off to resume the write. As such, the update cannot be read before a complete handoff, leading to inconsistent reads.”
C.Y. Kan, Cassandra Data Modeling and Analysis
“It also checks all the remaining replicas in the background. If a replica is found to be inconsistent, the coordinator will issue an update to bring back the consistency. This mechanism is called read repair.”
C.Y. Kan, Cassandra Data Modeling and Analysis
“When a read request comes in to a node, the data to be returned is merged from all the related SSTables and any unflushed memtables. Timestamps are used to determine which one is up-to-date. The merged value is also stored in a write-through row cache to improve the future read performance.”
C.Y. Kan, Cassandra Data Modeling and Analysis
“On the flip side, the following figure shows the components and their sequence of executions that form a read path: Cassandra read path”
C.Y. Kan, Cassandra Data Modeling and Analysis
“For write operations, Cassandra supports tunable consistency by various write consistency levels. The write consistency level is the number of replicas that acknowledge a successful write. It is tunable on a spectrum of write consistency levels, as shown in the following figure: Cassandra write consistency levels”
C.Y. Kan, Cassandra Data Modeling and Analysis
“Bloom filter Bloom filter is a sample subset of the primary index with very fast nondeterministic algorithms to check whether an element is a member of a set. It is used to boost the performance.”
C.Y. Kan, Cassandra Data Modeling and Analysis
“The commitlog is purged after the flush.”
C.Y. Kan, Cassandra Data Modeling and Analysis
“The following figure depicts the components and their sequence of executions that form a write path: Cassandra write path”
C.Y. Kan, Cassandra Data Modeling and Analysis
“Cassandra uses a very efficient algorithm, called Phi Accrual Failure Detection Algorithm, to detect the failure of a node. The idea of the algorithm is that the failure detection is not represented by a Boolean value stating whether a node is up or down. Instead, the algorithm outputs a value on the continuous suspicion level between dead and alive, on how confident it is that the node has failed.”
C.Y. Kan, Cassandra Data Modeling and Analysis
“The replication factor per data center is set to three here. With two data centers, there are six replicas in total.”
C.Y. Kan, Cassandra Data Modeling and Analysis
“The node location can be determined by the rack and data center with reference to the node's IP address.”
C.Y. Kan, Cassandra Data Modeling and Analysis
“A snitch determines which data centers and racks to go for in order to make Cassandra aware of the network topology for routing the requests efficiently.”
C.Y. Kan, Cassandra Data Modeling and Analysis

« previous 1