Book Image

Rapid - Apache Mahout Clustering designs

Book Image

Rapid - Apache Mahout Clustering designs

Overview of this book

Table of Contents (16 chapters)
Apache Mahout Clustering Designs
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Working with CSV files


Generally, a problem that arises while using Mahout algorithms is how to use files that are in CSV, TSV, or in a similar format. So, here, again, the main challenge is to convert the files into vector format. Once done, the rest of the process is the same as defined previously. Let's look at the code that takes a CSV file and writes the vector format that is usable by Mahout:

public String getSeqFile(String inputLocation) throws Exception {
  String outputPath="<output path>"; //Location where you want to save the output
  FileSystem fs = null;
  SequenceFile.Writer writer;
  fs = FileSystem.get(getConfiguration());
  Path vecoutput =new Path(outputPath);
  writer = new SequenceFile.Writer(fs, getConfiguration(), vecoutput, Text.class, VectorWritable.class);
  VectorWritable vec = new VectorWritable();
    try {
//File reader takes input location as an input.
      FileReader fr = new FileReader(inputLocation);
      BufferedReader br = new BufferedReader(fr)...