Book Image

Rapid - Apache Mahout Clustering designs

Book Image

Rapid - Apache Mahout Clustering designs

Overview of this book

Table of Contents (16 chapters)
Apache Mahout Clustering Designs
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Preparing the dataset


The dataset preparation is the most important task of any machine learning related activity. You are not going to get text or structured data in all use cases. Collecting the data in the system where you are applying an algorithm is an interesting task. Data can be collected using different ways, such as:

  • Pulling the data from the relational database to the Hadoop cluster (using Apache Sqoop)

  • Continuously streaming data into Hadoop. The Hadoop ecosystem provides lots of way to do this, examples include Flume, storm, and so on

  • Other ways include getting data using ftp, and so on

For example, in this chapter, we will pick up the use case where we will get a continuous stream of data into our system. We will take up the use case from Twitter. Based on tweets from the users, we will try to cluster similar users together. In a real-world production scenario, we will use one of the available technologies in the Hadoop ecosystem to collect a live stream of tweets (we can select...