Modern Scala Projects

Modern Scala Projects

By : Ilango gurusamy

Buy this Book

Modern Scala Projects

By: Ilango gurusamy

Buy this Book

Overview of this book

Scala is both a functional programming and object-oriented programming language designed to express common programming patterns in a concise, readable, and type-safe way. Complete with step-by-step instructions, Modern Scala Projects will guide you in exploring Scala capabilities and learning best practices. Along the way, you'll build applications for professional contexts while understanding the core tasks and components. You’ll begin with a project for predicting the class of a flower by implementing a simple machine learning model. Next, you'll create a cancer diagnosis classification pipeline, followed by tackling projects delving into stock price prediction, spam filtering, fraud detection, and a recommendation engine. The focus will be on application of ML techniques that classify data and make predictions, with an emphasis on automating data workflows with the Spark ML pipeline API. The book also showcases the best of Scala’s functional libraries and other constructs to help you roll out your own scalable data processing frameworks. By the end of this Scala book, you’ll have a firm foundation in Scala programming and have built some interesting real-world projects to add to your portfolio.

Title Page

Packt Upsell

Contributors

Preface

Free Chapter

Predict the Class of a Flower from the Iris Dataset

A multivariate classification problem

Project overview – problem formulation

Getting started with Spark

Implementing the Iris pipeline

Summary

Questions

Build a Breast Cancer Prognosis Pipeline with the Power of Spark and Scala

Breast cancer classification problem

Getting started

Random Forest breast cancer pipeline

LR breast cancer pipeline

Summary

Questions

Stock Price Predictions

Stock price binary classification problem

Getting started

Implementation objectives

Summary

Questions

Building a Spam Classification Pipeline

Spam classification problem

Project overview – problem formulation

Getting started

Spam classification pipeline

Summary

Questions

Project overview – problem formulation

The intent of this project is to develop an ML workflow or more accurately a pipeline. The goal is to solve the classification problem on the most famous dataset in data science history.

If we saw a flower out in the wild that we know belongs to one of three Iris species, we have a classification problem on our hands. If we made measurements (X) on the unknown flower, the task is to learn to recognize the species to which the flower (and its plant) belongs.

Categorical variables represent types of data which may be divided into groups. Examples of categorical variables are race, sex, age group, and educational level. While the latter two variables may also be considered in a numerical manner by using exact values for age and highest grade completed, it is often more informative to categorize such variables into a relatively small number of groups.

Analysis of categorical data generally involves the use of data tables. A two-way table presents categorical data by counting the number of observations that fall into each group for two variables, one divided into rows and the other divided into columns.

In a nutshell, the high-level formulation of the classification problem is given as follows:

High-level formulation of the Iris supervised learning classification problem

Note

In the Iris dataset, each row contains categorical data (values) in the fifth column. Each such value is associated with a label (Y).

The formulation consists of the following:

Observed features
Category labels

Observed features are also known as predictor variables. Such variables have predetermined measured values. These are the inputs X. On the other hand, category labels denote possible output values that predicted variables can take.

The predictor variables are as follows:

sepal_length: It represents sepal length, in centimeters, used as input
sepal_width: It represents sepal width, in centimeters, used as input
petal_length: It represents petal length, in centimeters, used as input

petal_width: It represents petal width, in centimeters, used as input
setosa: It represents Iris-setosa, true or false, used as target
versicolour: It represents Iris-versicolour, true or false, used as target
virginica: It represents Iris-virginica, true or false, used as target

Four outcome variables were measured from each sample; the length and the width of the sepals and petals.

The total build time of the project should be no more than a day in order to get everything working. For those new to the data science area, understanding the background theory, setting up the software, and getting to build the pipeline could take an extra day or two.

Modern Scala Projects

By : Ilango gurusamy

Modern Scala Projects

By: Ilango gurusamy

Overview of this book

Related Content you might be interested in

Current Title:

Modern Scala Projects

Scala and Spark for Big Data Analytics

Apache Spark 2.x Machine Learning Cookbook

Hands-On Data Analysis with Scala

Project overview – problem formulation

Note