Book Image

Learning Cascading

Book Image

Learning Cascading

Overview of this book

Table of Contents (18 chapters)
Learning Cascading
Credits
Foreword
About the Authors
About the Reviewers
www.PacktPub.com
Preface
7
Optimizing the Performance of a Cascading Application
Index

Project scope – understanding requirements


Just like most real-life situations, this project consists of both structured and unstructured data. Unstructured and semi-structured data are media articles, press releases, trade literature, blog posts, tweets, and so on. Unstructured files can arrive to a researcher in the form of text, PDF, Word, HTML, and many other formats. Structured data is usually delimiter-separated (most often comma-separated, such as CSV, or tab-separated, such as TSV) text files with or without a header. These structured files can be used by Cascading as they are, but unstructured data needs preprocessing.

The steps that we used to pre-pre-process our unstructured data are:

  1. First convert unstructured files of different formats into text. You can write your own utility to do this, or you can download it from the Web. In this book, we will not provide a conversion utility, since it is outside our scope.

  2. Then normalize the unstructured data. For this project, after we've...