Rapid - Apache Mahout Clustering designs

The dataset preparation is the most important task of any machine learning related activity. You are not going to get text or structured data in all use cases. Collecting the data in the system where you are applying an algorithm is an interesting task. Data can be collected using different ways, such as:

Pulling the data from the relational database to the Hadoop cluster (using Apache Sqoop)
Continuously streaming data into Hadoop. The Hadoop ecosystem provides lots of way to do this, examples include Flume, storm, and so on
Other ways include getting data using ftp, and so on

For example, in this chapter, we will pick up the use case where we will get a continuous stream of data into our system. We will take up the use case from Twitter. Based on tweets from the users, we will try to cluster similar users together. In a real-world production scenario, we will use one of the available technologies in the Hadoop ecosystem to collect a live stream of tweets (we can select...

Rapid - Apache Mahout Clustering designs

Rapid - Apache Mahout Clustering designs

Overview of this book

Related Content you might be interested in

Current Title:

Rapid - Apache Mahout Clustering designs

Preparing the dataset