Book Image

Apache Kafka Quick Start Guide

By : Raúl Estrada
Book Image

Apache Kafka Quick Start Guide

By: Raúl Estrada

Overview of this book

Apache Kafka is a great open source platform for handling your real-time data pipeline to ensure high-speed filtering and pattern matching on the ?y. In this book, you will learn how to use Apache Kafka for efficient processing of distributed applications and will get familiar with solving everyday problems in fast data and processing pipelines. This book focuses on programming rather than the configuration management of Kafka clusters or DevOps. It starts off with the installation and setting up the development environment, before quickly moving on to performing fundamental messaging operations such as validation and enrichment. Here you will learn about message composition with pure Kafka API and Kafka Streams. You will look into the transformation of messages in different formats, such asext, binary, XML, JSON, and AVRO. Next, you will learn how to expose the schemas contained in Kafka with the Schema Registry. You will then learn how to work with all relevant connectors with Kafka Connect. While working with Kafka Streams, you will perform various interesting operations on streams, such as windowing, joins, and aggregations. Finally, through KSQL, you will learn how to retrieve, insert, modify, and delete data streams, and how to manipulate watermarks and windows.
Table of Contents (10 chapters)

Running Kafka brokers

The real art behind a server is in its configuration. In this section, we will examine how to deal with the basic configuration of a Kafka broker in standalone mode. Since we are learning, at the moment, we will not review the cluster configuration.

As we can suppose, there are two types of configuration: standalone and cluster. The real power of Kafka is unlocked when running with replication in cluster mode and all topics are correctly partitioned.

The cluster mode has two main advantages: parallelism and redundancy. Parallelism is the capacity to run tasks simultaneously among the cluster members. The redundancy warrants that, when a Kafka node goes down, the cluster is safe and accessible from the other running nodes.

This section shows how to configure a cluster with several nodes on our local machine although, in practice, it is always better to have several machines with multiple nodes sharing clusters.

Go to the Confluent Platform installation directory, referenced from now on as <confluent-path>.

As mentioned in the beginning of this chapter, a broker is a server instance. A server (or broker) is actually a process running in the operating system and starts based on its configuration file.

The people of Confluent have kindly provided us with a template of a standard broker configuration. This file, which is called server.properties, is located in the Kafka installation directory in the config subdirectory:

  1. Inside <confluent-path>, make a directory with the name mark.
  1. For each Kafka broker (server) that we want to run, we need to make a copy of the configuration file template and rename it accordingly. In this example, our cluster is going to be called mark:
> cp config/server.properties <confluent-path>/mark/mark-1.properties
> cp config/server.properties <confluent-path>/mark/mark-2.properties
  1. Modify each properties file accordingly. If the file is called mark-1, the broker.id should be 1. Then, specify the port in which the server will run; the recommendation is 9093 for mark-1 and 9094 for mark-2. Note that the port property is not set in the template, so add the line. Finally, specify the location of the Kafka logs (a Kafka log is a specific archive to store all of the Kafka broker operations); in this case, we use the /tmp directory. Here, it is common to have problems with write permissions. Do not forget to give write and execute permissions to the user with whom these processes are executed over the log directory, as in the examples:
  • In mark-1.properties, set the following:
      broker.id=1
port=9093
log.dirs=/tmp/mark-1-logs
  • In mark-2.properties, set the following:
      broker.id=2
port=9094
log.dirs=/tmp/mark-2-logs
  1. Start the Kafka brokers using the kafka-server-start command with the corresponding configuration file passed as the parameter. Don't forget that Confluent Platform must be already running and the ports should not be in use by another process. Start the Kafka brokers as follows:
      > <confluent-path>/bin/kafka-server-start <confluent-
path>/mark/mark-1.properties &

And, in another command-line window, run the following command:

      > <confluent-path>/bin/kafka-server-start <confluent-
path>/mark/mark-2.properties &

Don't forget that the trailing & is to specify that you want your command line back. If you want to see the broker output, it is recommended to run each command separately in its own command-line window.

Remember that the properties file contains the server configuration and that the server.properties file located in the config directory is just a template.

Now there are two brokers, mark-1 and mark-2 , running in the same machine in the same cluster.

Remember, there are no dumb questions, as in the following examples:

Q: How does each broker know which cluster it belongs to?

A: The brokers know that they belong to the same cluster because, in the configuration, both point to the same Zookeeper cluster.

Q: How does each broker differ from the others within the same cluster?

A: Every broker is identified inside the cluster by the name specified in the broker.id property.

Q: What happens if the port number is not specified?

A: If the port property is not specified, Zookeeper will assign the same port number and will overwrite the data.

Q: What happens if the log directory is not specified?

A: If log.dir is not specified, all the brokers will write to the same default log.dir. If the brokers are planned to run in different machines, then the port and log.dir properties might not be specified (because they run in the same port and log file but in different machines).

Q: How can I check that there is not a process already running in the port where I want to start my broker?

A: As shown in the previous section, there is a useful command to see what process is running on specific port, in this case the 9093 port:

> lsof -i :9093

The output of the previous command is something like this:

COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME

java 12529 admin 406u IPv6 0xc41a24baa4fedb11 0t0 TCP *:9093 (LISTEN)

Your turn: try to run this command before starting the Kafka brokers, and run it after starting them to see the change. Also, try to start a broker on a port in use to see how it fails.

OK, what if I want my cluster to run on several machines?

To run Kafka nodes on different machines but in the same cluster, adjust the Zookeeper connection string in the configuration file; its default value is as follows:

zookeeper.connect=localhost:2181

Remember that the machines must be able to be found by each other by DNS and that there are no network security restrictions between them.

The default value for Zookeeper connect is correct only if you are running the Kafka broker in the same machine as Zookeeper. Depending on the architecture, it will be necessary to decide if there will be a broker running on the same Zookeeper machine.

To specify that Zookeeper might run in other machines, do the following:

zookeeper.connect=localhost:2181, 192.168.0.2:2183, 192.168.0.3:2182

The previous line specifies that Zookeeper is running in the local host machine on port 2181, in the machine with IP address 192.168.0.2 on port 2183 , and in the machine with IP address, the 192.168.0.3, on port 2182. The Zookeeper default port is 2181, so normally it runs there.

Your turn: as an exercise, try to start a broker with incorrect information about the Zookeeper cluster. Also, using the lsof command, try to raise Zookeeper on a port in use.

If you have doubts about the configuration, or it is not clear what values to change, the server.properties template (as all of the Kafka project) is open sourced in the following:

https://github.com/apache/kafka/blob/trunk/config/server.properties