Book Image

Apache Cassandra Essentials

By : Nitin Padalia
Book Image

Apache Cassandra Essentials

By: Nitin Padalia

Overview of this book

Apache Cassandra Essentials takes you step-by-step from from the basics of installation to advanced installation options and database design techniques. It gives you all the information you need to effectively design a well distributed and high performance database. You’ll get to know about the steps that are performed by a Cassandra node when you execute a read/write query, which is essential to properly maintain of a Cassandra cluster and to debug any issues. Next, you’ll discover how to integrate a Cassandra driver in your applications and perform read/write operations. Finally, you’ll learn about the various tools provided by Cassandra for serviceability aspects such as logging, metrics, backup, and recovery.
Table of Contents (14 chapters)
Apache Cassandra Essentials
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Installation


Apache provides source as well as binary tarballs and Debian packages. However, third-party vendors, such as Datastax, provide MSI installer, Linux RPM, Debian packages, and UNIX and Mac OS X binary in the form of community edition, which is a free packaged distribution of Apache Cassandra by Datastax. Here, we'll cover installation using binary tarball and source tarball packages.

Prerequisites

The following are the prerequisites for installing Cassandra:

  • Hardware requirements: Cassandra employs various caching techniques to enable ultra-fast read operations; hence more memory enables Cassandra to cache more data hence more memory would lead to better performance. Minimum 4GB memory is recommended for development environments and minimum 8GB memory for production environments. If our data set is bigger we should consider upgrading memory used by Cassandra. We'll discuss more about tuning Cassandra memory in later chapters. Similar to memory, more number of CPUs helps Cassandra to perform better as Cassandra performs its task concurrently. For bare-metal hardware, 8-core servers are recommended and for virtualized machines it's recommended that CPU cycles allocated to machines could grow on demand, for example some vendors like Rackspace and Amazon use CPU bursting. For development environments you could use single disk machine, however for production machines ideally there should be at least two disks. One disk is used for commitlog and other for storing data files called SSTables, so that I/O contention doesn't happen for both these operations. The commitlog file is used by Cassandra to make write requests durable. Every write request is first written to this file in append only mode and an in memory representation of column family called memtable.

  • Java: Cassandra can run on Oracle/Sun JVM, OpenJDK, and IBM JVM. The current stable version of Cassandra requires Java 7 or later version. Set your JAVA_HOME environment variable to the correct version of Java if you are using multiple Java versions on your machine.

  • Python: The current version of Cassandra requires Python 2.6 or above. Cassandra tools, such as cqlsh, are based on Python.

  • Firewall configurations: Since we are setting up a cluster, let's see which ports are used by Cassandra on various interfaces. If the firewall blocks these ports because we fail to configure them, then our cluster won't function properly. For example, if the internode communication port is being blocked, then nodes will not be able to join the cluster.

    Lets have a look at the following table

    Port/Protocol

    Configuration file

    Configuration name

    Firewall setting

    Description

    7000/tcp

    cassandra.yaml

    storage_port

    Open among nodes in the cluster

    It acts as an internode communication port in a Cassandra cluster.

    7001/tcp

    cassandra.yaml

    ssl_storage_port

    Open among nodes in the cluster

    It is a SSL port for encrypted communication among cluster nodes.

    9042/tcp

    cassandra.yaml

    native_transport_port

    Between the Cassandra client and the cluster

    Cassandra clients, for example cqlsh, or clients using the JAVA driver use this port to communicate with the Cassandra server.

    9160/tcp

    cassandra.yaml

    rpc_port

    The Thrift client and the Cassandra cluster

    Thrift uses this port for client connections.

    7199/tcp

    cassandra-env.sh

    JMX_PORT

    Between the JMX console and the Cassandra cluster

    It acts as an JMX console port for monitoring the Cassandra server.

  • Clock syncronization: Since Cassandra depends heavily on timestamps for data consistency purposes, all nodes of our cluster should be time synchronized. Ensure that we verify this. One of the methods we can use for time synchronization is configuring NTP on each node. NTP (Network Time Protocol) is widely used protocol for clock synchronization of computers over a network.

Compiling Cassandra from source and installing

The following method of installation is less used. One of the cases when we might use this method is if we're doing some optimization work on Cassandra. We'll need JDK 1.7, ANT 1.8, or later versions to compile the Cassandra code. Optionally, we can directly clone from the Cassandra Git repository or we can use the source tarball. Git client 1.7 will be required for cloning git repo.

To obtain the latest source code from Git, use the following command:

$ git clone http://git://git-wip-us.apache.org/repos/asf/cassandra.git Cassandra

For a specific branch, use the following command:

$ git clone -b cassandra-<version> http://git://git-wip-us.apache.org/repos/asf/cassandra.git

Use this command for version 1.2:

$ git clone -b cassandra-2.1.2 http://git://git-wip-us.apache.org/repos/asf/cassandra.git

Then, use the ant command to build the code:

$ ant

Alternatively, if a proxy is needed to connect to the Internet, use the autoproxy flag:

$ ant –autoproxy

or

$ export ANT_OPTS="-Dhttp.proxyHost=<your-proxy-host> -Dhttp.proxyPort=<your-proxy-port>"

Installation from a precompiled binary

Download a binary tarball from the Apache website; open it using the following command. Here, we will extract it in the /opt directory:

$ tar xzf apache-cassandra-<Version>.bin.tar.gz –C /opt

Consider the following example:

$ tar xzf apache-cassandra-2.1.2.bin.tar.gz –C /opt

Optionally, you can create a soft link as a best practice, which will help in scenarios where you need to change the installation location:

$ ln –s apache-cassandra-2.1.2 cassandra

The Cassandra installation layout may be different based on your type of installation. If you're installing using Debian or an RPM package, then the installation creates the required directories and applies the required permissions.

In older versions of Cassandra, you might need to create Cassandra log and data directories before running; by default, they are pointed to /var/lib/cassandra and /var/log/Cassandra. Running Cassandra will fail if the user running Cassandra doesn't have permissions for these paths. You can create and set permissions as shown here:

$ sudo mkdir -p /var/log/Cassandra
$ sudo chown -R `whoami` /var/log/Cassandra
$ sudo mkdir -p /var/lib/Cassandra
$ sudo chown -R `whoami` /var/lib/cassandra