Book Image

Learning Cassandra for Administrators

By : Vijay Parthasarathy
Book Image

Learning Cassandra for Administrators

By: Vijay Parthasarathy

Overview of this book

<p>Apache Cassandra is a massively scalable open source NoSQL database. Cassandra is perfect for managing large amounts of structured, semi-structured, and unstructured data across multiple data centers and the cloud. Cassandra delivers linear scalability and performance across many commodity servers with no single point of failure.<br /><br />This book starts by explaining how to derive the solution, basic concepts, and CAP theorem. You will learn how to install and configure a Cassandra cluster as well as tune the cluster for performance. After reading the book, you should be able to understand why the system works in a particular way, and you will also be able to find patterns (and/or use cases) and anti-patterns which would potentially cause performance degradation. Furthermore, the book explains how to configure Hadoop, vnodes, multi-DC clusters, enabling trace, enabling various security features, and querying data from Cassandra.<br /><br />Starting with explaining about the trade-offs, we gradually learn about setting up and configuring high performance clusters. This book will help the administrators understand the system better by understanding various components in Cassandra’s architecture and hence be more productive in operating the cluster. This book talks about the use cases and problems, anti-patterns, and potential practical solutions as opposed to raw techniques. You will learn about kernel and JVM tuning parameters that can be adjusted to get the maximum use out of system resources.<br /><br /><br /></p>
Table of Contents (14 chapters)

Chapter 4. Administration and Large Deployments

In this chapter, we will talk about the basic administrative tasks and tools to manage data and its consistency.

There are three features in Cassandra that can make data consistent, and they are as follows:

  • Hinted handoff

  • Manual repair

  • Read repair

Hinted handoff is the process in which if the write is not successful on a node or the node is not able to complete the writes in time, a hint is stored in the coordinator to be replayed at a later point in time when the node is back online.

The downside of this approach is that a node that has been down for a long time comes back online; all the nodes will start to replay hints in order to make the node consistent. These processes can eventually overwhelm the node with hint replay mutations. To avoid this situation, Cassandra replays are throttled by replaying a configured amount of bytes at a time and waiting for the mutations to respond; refer to hinted_handoff_throttle_in_kb to tune this number.

To...