Book Image

Scala Machine Learning Projects

Book Image

Scala Machine Learning Projects

Overview of this book

Machine learning has had a huge impact on academia and industry by turning data into actionable information. Scala has seen a steady rise in adoption over the past few years, especially in the fields of data science and analytics. This book is for data scientists, data engineers, and deep learning enthusiasts who have a background in complex numerical computing and want to know more hands-on machine learning application development. If you're well versed in machine learning concepts and want to expand your knowledge by delving into the practical implementation of these concepts using the power of Scala, then this book is what you need! Through 11 end-to-end projects, you will be acquainted with popular machine learning libraries such as Spark ML, H2O, DeepLearning4j, and MXNet. At the end, you will be able to use numerical computing and functional programming to carry out complex numerical tasks to develop, build, and deploy research or commercial projects in a production-ready environment.
Table of Contents (17 chapters)
Title Page
Packt Upsell
Contributors
Preface
Index

Topic modeling and text clustering


In TM, a topic is defined by a cluster of words, with each word in the cluster having a probability of occurrence for the given topic, and different topics having their respective clusters of words along with corresponding probabilities. Different topics may share some words, and a document can have more than one topic associated with it. So in short, we have a collection of text datasets—that is, a set of text files. Now the challenging part is finding useful patterns about the data using LDA.

There is a popular TM approach, based on LDA, where each document is considered a mixture of topics and each word in a document is considered randomly drawn from a document's topics. The topics are considered hidden and must be uncovered via analyzing joint distributions to compute the conditional distribution of hidden variables (topics), given the observed variables and words in documents. The TM technique is widely used in the task of mining text from a large collection...