Book Image

Scala Machine Learning Projects

Book Image

Scala Machine Learning Projects

Overview of this book

Machine learning has had a huge impact on academia and industry by turning data into actionable information. Scala has seen a steady rise in adoption over the past few years, especially in the fields of data science and analytics. This book is for data scientists, data engineers, and deep learning enthusiasts who have a background in complex numerical computing and want to know more hands-on machine learning application development. If you're well versed in machine learning concepts and want to expand your knowledge by delving into the practical implementation of these concepts using the power of Scala, then this book is what you need! Through 11 end-to-end projects, you will be acquainted with popular machine learning libraries such as Spark ML, H2O, DeepLearning4j, and MXNet. At the end, you will be able to use numerical computing and functional programming to carry out complex numerical tasks to develop, build, and deploy research or commercial projects in a production-ready environment.
Table of Contents (17 chapters)
Title Page
Packt Upsell
Contributors
Preface
Index

Preface

Machine learning has made a huge impact on academia and industry by turning data into actionable intelligence. Scala, on the other hand, has been observing a steady rise in its adoption over the last few years, especially in the field of data science and analytics. This book has been written for data scientists, data engineers, and deep learning enthusiasts who have a solid background with complex numerical computing and want to learn more hands-on machine learning application development.

So, if you're well-versed in machine learning concepts and want to expand your knowledge by delving into practical implementations using the power of Scala, then this book is what you need! Through 11 end-to-end projects, you will be acquainted with popular machine learning libraries such as Spark ML, H2O, Zeppelin, DeepLearning4j, and MXNet.

After reading this book and practicing all of the projects, you will be able to dominate numerical computing, deep learning, and functional programming to carry out complex numerical tasks. You can thus develop, build, and deploy research and commercial projects in a production-ready environment.

This book isn’t meant to be read cover to cover. You can turn the pages to a chapter that looks like something you’re trying to accomplish or that simply ignites your interest. But any kind of improvement feedback is welcome.

Happy reading!

Who this book is for

If you want to leverage the power of both Scala and open source libraries such as Spark ML, Deeplearning4j, H2O, MXNet, and Zeppelin to make sense of Big Data, then this book is for you. A strong understanding of Scala and the Scala Play Framework is recommended. Basic familiarity with ML techniques will be an added advantage.

What this book covers

Chapter 1, Analyzing Insurance Severity Claims, shows how to develop a predictive model for analyzing insurance severity claims using some widely used regression techniques. We will demonstrate how to deploy this model in a production-ready environment.

Chapter 2, Analyzing and Predicting Telecommunication Churn, uses the Orange Telecoms Churn dataset, consisting of cleaned customer activity and churn labels specifying whether customers canceled their subscription or not, to develop a real-life predictive model.

Chapter 3, High-Frequency Bitcoin Price Prediction from Historical and Live Data, shows how to develop a real-life project that collects historical and live data. We predict the Bitcoin price for the upcoming weeks, months, and so on. In addition, we demonstrate how to generate a simple signal for online trading in Bitcoin. Finally, this chapter wraps up the whole application as a web app using the Scala Play Framework.

Chapter 4, Population-Scale Clustering and Ethnicity Prediction, uses genomic variation data from the 1,000 Genome Project to apply the K-means clustering approach to scalable genomic data analysis. This is aimed at clustering genotypic variants at the population scale. Finally, we train deep neural network and random forest models to predict ethnicity. 

Chapter 5, Topic Modeling in NLP – A Better Insight into Large-Scale Texts, shows how to develop a topic modeling application by utilizing the Spark-based LDA algorithm and Stanford NLP to handle large-scale raw texts.

Chapter 6, DevelopingModel-Based Movie Recommendation Engines, shows how to develop a scalable movie recommendation engine by inter-operating between singular value decomposition, ALS, and matrix factorization. The movie lens dataset will be used for this end-to-end project.

Chapter 7, Options Trading using Q-Learning and the Scala Play Framework, applies a reinforcement QLearning algorithm on real-life IBM stock datasets and designs a machine learning system driven by criticisms and rewards. The goal is to develop a real-life application called options trading. The chapter wraps up the whole application as a web app using the Scala Play Framework.

Chapter 8, Clients Subscription Assessment for Bank Telemarketing using Deep Neural Networks , is an end-to-end project that shows how to  solve a real-life problem called client subscription assessment. An H2O deep neural network will be trained using a bank telemarketing dataset. Finally, the chapter evaluates the performance of this predictive model.

 Chapter 9, Fraud Analytics using Autoencoders and Anomaly Detection, uses autoencoders and the anomaly detection technique for fraud analytics. The dataset used is a fraud detection dataset collected and analyzed during a research collaboration by Worldline and the Machine Learning Group of ULB (Université Libre de Bruxelles).

Chapter 10Human Activity Recognition using Recurrent Neural Networks, includes another end-to-end project that shows how to use an RNN implementation called LSTM for human activity recognition using a smartphone sensor dataset.

Chapter 11Image Classification using Convolutional Neural Networks, demonstrates how to develop predictive analytics applications such as image classification, using convolutional neural networks on a real image dataset called Yelp.

To get the most out of this book

This book is dedicated to developers, data analysts, and deep learning enthusiasts who do not have much background with complex numerical computations but want to know what deep learning is. A strong understanding of Scala and its functional programming concepts is recommended. Some basic understanding and high-level knowledge of Spark ML, H2O, Zeppelin, DeepLearning4j, and MXNet would act as an added advantage in order to grasp this book. Additionally, basic know-how of build tools such as Maven and SBT is assumed.

All the examples have been implemented using Scala on an Ubuntu 16.04 LTs 64-bit and Windows 10 64-bit. You will also need the following (preferably the latest versions):

  • Apache Spark 2.0.0 (or higher)
  • MXNet, Zeppelin, DeepLearning4j, and H2O (see the details in the chapter and in the supplied pom.xml files)
  • Hadoop 2.7 (or higher)
  • Java (JDK and JRE) 1.7+/1.8+
  • Scala 2.11.x (or higher)
  • Eclipse Mars or Luna (latest) with Maven plugin (2.9+), Maven compiler plugin (2.3.2+), and Maven assembly plugin (2.4.1+)
  • IntelliJ IDE
  • SBT plugin and Scala Play Framework installed

A computer with at least a Core i3 processor, Core i5 (recommended), or Core i7 (to get the best results) is needed. However, multicore processing will provide faster data processing and scalability. At least 8 GB RAM is recommended for standalone mode; use at least 32 GB RAM for a single VM and higher for a cluster. You should have enough storage for running heavy jobs (depending on the dataset size you will be handling); preferably, at least 50 GB of free disk storage (for standalone and for SQL Warehouse).

Linux distributions are preferable (including Debian, Ubuntu, Fedora, RHEL, CentOS, and many more). To be more specific, for example, for Ubuntu it is recommended to have a 14.04 (LTS) 64-bit (or later) complete installation, VMWare player 12, or VirtualBox. You can run Spark jobs on Windows (XP/7/8/10) or Mac OS X (10.4.7+).

Download the example code files

You can download the example code files for this book from your account at www.packtpub.com. If you purchased this book elsewhere, you can visit www.packtpub.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

  1. Log in or register at www.packtpub.com.
  2. Select the SUPPORT tab.
  3. Click on Code Downloads & Errata.
  4. Enter the name of the book in the Search box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

  • WinRAR/7-Zip for Windows
  • Zipeg/iZip/UnRarX for Mac
  • 7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Scala-Machine-Learning-Projects. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: http://www.packtpub.com/sites/default/files/downloads/ScalaMachineLearningProjects_ColorImages.pdf.

Conventions used

There are a number of text conventions used throughout this book.

CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "Mount the downloaded WebStorm-10*.dmg disk image file as another disk in your system."

A block of code is set as follows:

val cv = new CrossValidator()
      .setEstimator(pipeline)
      .setEvaluator(new RegressionEvaluator)
      .setEstimatorParamMaps(paramGrid)
      .setNumFolds(numFolds)

Scala functional code blocks look as follows:

 def variantId(genotype: Genotype): String = {
      val name = genotype.getVariant.getContigName
      val start = genotype.getVariant.getStart
      val end = genotype.getVariant.getEnd
      s"$name:$start:$end"
  }

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

var paramGrid = new ParamGridBuilder()
      .addGrid(dTree.impurity, "gini" :: "entropy" :: Nil)
      .addGrid(dTree.maxBins, 3 :: 5 :: 9 :: 15 :: 23 :: 31 :: Nil)
      .addGrid(dTree.maxDepth, 5 :: 10 :: 15 :: 20 :: 25 :: 30 :: Nil)
      .build()

Any command-line input or output is written as follows:

$ sudo mkdir Bitcoin
$ cd Bitcoin

Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "Select System info from the Administration panel."

Note

Warnings or important notes appear like this.

Note

Tips and tricks appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: Email [email protected] and mention the book title in the subject of your message. If you have questions about any aspect of this book, please email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packtpub.com.