Mastering Java Machine Learning

Mastering Java Machine Learning

By : Uday Kamath, Krishna Choppella

Buy this Book

Mastering Java Machine Learning

By: Uday Kamath, Krishna Choppella

Buy this Book

Overview of this book

Java is one of the main languages used by practicing data scientists; much of the Hadoop ecosystem is Java-based, and it is certainly the language that most production systems in Data Science are written in. If you know Java, Mastering Machine Learning with Java is your next step on the path to becoming an advanced practitioner in Data Science. This book aims to introduce you to an array of advanced techniques in machine learning, including classification, clustering, anomaly detection, stream learning, active learning, semi-supervised learning, probabilistic graph modeling, text mining, deep learning, and big data batch and stream machine learning. Accompanying each chapter are illustrative examples and real-world case studies that show how to apply the newly learned techniques using sound methodologies and the best Java-based tools available today. On completing this book, you will have an understanding of the tools and techniques for building powerful machine learning models to solve data science problems in just about any domain.

Mastering Java Machine Learning

Credits

Foreword

About the Authors

About the Reviewers

www.PacktPub.com

Customer Feedback

Preface

Free Chapter

Machine Learning Review

Machine learning – history and definition

What is not machine learning?

Machine learning – concepts and terminology

Machine learning – types and subtypes

Datasets used in machine learning

Machine learning applications

Practical issues in machine learning

Machine learning – roles and process

Machine learning – tools and datasets

Summary

Practical Approach to Real-World Supervised Learning

Formal description and notation

Data transformation and preprocessing

Feature relevance analysis and dimensionality reduction

Model building

Model assessment, evaluation, and comparisons

Case Study – Horse Colic Classification

Summary

References

Unsupervised Machine Learning Techniques

Issues in common with supervised learning

Issues specific to unsupervised learning

Feature analysis and dimensionality reduction

Clustering

Outlier or anomaly detection

Real-world case study

Summary

References

Semi-Supervised and Active Learning

Semi-supervised learning

Active learning

Case study in active learning

Summary

References

Real-Time Stream Machine Learning

Assumptions and mathematical notations

Basic stream processing and computational techniques

Concept drift and drift detection

Incremental supervised learning

Incremental unsupervised learning using clustering

Unsupervised learning using outlier detection

Case study in stream learning

Summary

References

Probabilistic Graph Modeling

Probability revisited

Graph concepts

Bayesian networks

Markov networks and conditional random fields

Summary

Deep Learning

Multi-layer feed-forward neural network

Limitations of neural networks

Deep learning

Case study

Summary

References

Text Mining and Natural Language Processing

NLP, subfields, and tasks

Issues with mining unstructured data

Text processing components and transformations

Topics in text mining

Tools and usage

Summary

References

Big Data Machine Learning – The Final Frontier

What are the characteristics of Big Data?

Big Data Machine Learning

Batch Big Data Machine Learning

Case study

Linear Algebra

Vector

Matrix

Probability

Axioms of probability

Bayes' theorem

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Practical issues in machine learning

It is necessary to appreciate the nature of the constraints and potentially sub-optimal conditions one may face when dealing with problems requiring machine learning. An understanding of the nature of these issues, the impact of their presence, and the methods to deal with them will be addressed throughout the discussions in the coming chapters. Here, we present a brief introduction to the practical issues that confront us:

Data quality and noise: Missing values, duplicate values, incorrect values due to human or instrument recording error, and incorrect formatting are some of the important issues to be considered while building machine learning models. Not addressing data quality can result in incorrect or incomplete models. In the next chapter, we will highlight some of these issues and some strategies to overcome them through data cleansing.
Imbalanced datasets: In many real-world datasets, there is an imbalance among labels in the training data. This imbalance in a dataset affects the choice of learning, the process of selecting algorithms, model evaluation and verification. If the right techniques are not employed, the models can suffer large biases, and the learning is not effective. Detailed in the next few chapters are various techniques that use meta-learning processes, such as cost-sensitive learning, ensemble learning, outlier detection, and so on, which can be employed in these situations.
Data volume, velocity, and scalability: Often, a large volume of data exists in raw form or as real-time streaming data at high speed. Learning from the entire data becomes infeasible either due to constraints inherent to the algorithms or hardware limitations, or combinations thereof. In order to reduce the size of the dataset to fit the resources available, data sampling must be done. Sampling can be done in many ways, and each form of sampling introduces a bias. Validating the models against sample bias must be performed by employing various techniques, such as stratified sampling, varying sample sizes, and increasing the size of experiments on different sets. Using big data machine learning can also overcome the volume and sampling biases.
Overfitting: One of the core problems in predictive models is that the model is not generalized enough and is made to fit the given training data too well. This results in poor performance of the model when applied to unseen data. There are various techniques described in later chapters to overcome these issues.
Curse of dimensionality: When dealing with high-dimensional data, that is, datasets with a large number of features, scalability of machine learning algorithms becomes a serious concern. One of the issues with adding more features to the data is that it introduces sparsity, that is, there are now fewer data points on average per unit volume of feature space unless an increase in the number of features is accompanied by an exponential increase in the number of training examples. This can hamper performance in many methods, such as distance-based algorithms. Adding more features can also deteriorate the predictive power of learners, as illustrated in the following figure. In such cases, a more suitable algorithm is needed, or the dimensionality of the data must be reduced.
Curse of dimensionality illustrated in classification learning, where adding more features deteriorates classifier performance.

Mastering Java Machine Learning

By : Uday Kamath, Krishna Choppella

Mastering Java Machine Learning

By: Uday Kamath, Krishna Choppella

Overview of this book

Related Content you might be interested in

Current Title:

Mastering Java Machine Learning

Machine Learning in Java

Deep Learning with Hadoop

Mastering Machine Learning Algorithms.

Practical issues in machine learning