Book Image

Cassandra High Performance Cookbook

By : Edward Capriolo
Book Image

Cassandra High Performance Cookbook

By: Edward Capriolo

Overview of this book

<p>Apache Cassandra is a fault-tolerant, distributed data store which offers linear scalability allowing it to be a storage platform for large high volume websites. <br /><br />This book provides detailed recipes that describe how to use the features of Cassandra and improve its performance. Recipes cover topics ranging from setting up Cassandra for the first time to complex multiple data center installations. The recipe format presents the information in a concise actionable form.<br /><br />The book describes in detail how features of Cassandra can be tuned and what the possible effects of tuning can be. Recipes include how to access data stored in Cassandra and use third party tools to help you out. The book also describes how to monitor and do capacity planning to ensure it is performing at a high level. Towards the end, it takes you through the use of libraries and third party applications with Cassandra and Cassandra integration with Hadoop.</p>
Table of Contents (20 chapters)
Cassandra High Performance Cookbook
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Nodetool cleanup: Removing excess data


When a node is added to a Cassandra cluster or an existing node is moved to a new position on the token ring, other systems still retain copies of data they are not responsible for. Nodetool cleanup removes data that does not belong on this node.

How to do it...

Use the IP and JMX port as arguments to nodetool cleanup.

$ <cassandra_home>/bin/nodetool -h 127.0.0.1 -p 8001 cleanup 

Tip

Keyspace and Column Family are optional arguments

If called with no arguments, cleanup is run on all keyspaces and column families. The keyspace and column family can be specified at the end of the command to limit the data cleaned up.

How it works...

Cleanup is a special type of compaction that removes data that does not belong on the node. Cleanup is intensive because it has to examine large portions of the data on disk.

There's more...

There are two reasons where running cleanup is required. They are as follows:

Topology changes

The first and most common reason cleanup...