The first step while developing a machine-learning pipeline is to get the data in a place from where we can feed it to the training algorithm. In this case study, we will be using Kafka as the source of the training data.
For this, we will be writing a Kafka producer that will stream 80 percent of the data in the data file to the Kafka broker. The remaining 20 percent of the data will be stored in a file, which we will use to test our clustering model created by our topology.
We will be creating a Maven project for publishing data into Kafka. The following are the steps for creating the producer:
Create a new Maven project with the
com.learningstorm
group ID and theml-kafka-producer
artifact ID.Add the following dependencies for Kafka in the
pom.xml
file:<!-- Apache Kafka Dependency --> <dependency> <groupId>org.apache.kafka</groupId> <artifactId>kafka_2.10</artifactId> <version>0.8.0</version> ...