Mastering Java Machine Learning

Mastering Java Machine Learning

By : Uday Kamath, Krishna Choppella

Buy this Book

Mastering Java Machine Learning

By: Uday Kamath, Krishna Choppella

Buy this Book

Overview of this book

Java is one of the main languages used by practicing data scientists; much of the Hadoop ecosystem is Java-based, and it is certainly the language that most production systems in Data Science are written in. If you know Java, Mastering Machine Learning with Java is your next step on the path to becoming an advanced practitioner in Data Science. This book aims to introduce you to an array of advanced techniques in machine learning, including classification, clustering, anomaly detection, stream learning, active learning, semi-supervised learning, probabilistic graph modeling, text mining, deep learning, and big data batch and stream machine learning. Accompanying each chapter are illustrative examples and real-world case studies that show how to apply the newly learned techniques using sound methodologies and the best Java-based tools available today. On completing this book, you will have an understanding of the tools and techniques for building powerful machine learning models to solve data science problems in just about any domain.

Mastering Java Machine Learning

Credits

Foreword

About the Authors

About the Reviewers

www.PacktPub.com

Customer Feedback

Preface

Free Chapter

Machine Learning Review

Machine learning – history and definition

What is not machine learning?

Machine learning – concepts and terminology

Machine learning – types and subtypes

Datasets used in machine learning

Machine learning applications

Practical issues in machine learning

Machine learning – roles and process

Machine learning – tools and datasets

Summary

Practical Approach to Real-World Supervised Learning

Formal description and notation

Data transformation and preprocessing

Feature relevance analysis and dimensionality reduction

Model building

Model assessment, evaluation, and comparisons

Case Study – Horse Colic Classification

Summary

References

Unsupervised Machine Learning Techniques

Issues in common with supervised learning

Issues specific to unsupervised learning

Feature analysis and dimensionality reduction

Clustering

Outlier or anomaly detection

Real-world case study

Summary

References

Semi-Supervised and Active Learning

Semi-supervised learning

Active learning

Case study in active learning

Summary

References

Real-Time Stream Machine Learning

Assumptions and mathematical notations

Basic stream processing and computational techniques

Concept drift and drift detection

Incremental supervised learning

Incremental unsupervised learning using clustering

Unsupervised learning using outlier detection

Case study in stream learning

Summary

References

Probabilistic Graph Modeling

Probability revisited

Graph concepts

Bayesian networks

Markov networks and conditional random fields

Summary

Deep Learning

Multi-layer feed-forward neural network

Limitations of neural networks

Deep learning

Case study

Summary

References

Text Mining and Natural Language Processing

NLP, subfields, and tasks

Issues with mining unstructured data

Text processing components and transformations

Topics in text mining

Tools and usage

Summary

References

Big Data Machine Learning – The Final Frontier

What are the characteristics of Big Data?

Big Data Machine Learning

Batch Big Data Machine Learning

Case study

Linear Algebra

Vector

Matrix

Probability

Axioms of probability

Bayes' theorem

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Machine learning – roles and process

Any effort to apply machine learning to a large-sized problem requires the collaborative effort of a number of roles, each abiding by a set of systematic processes designed for rigor, efficiency, and robustness. The following roles and processes ensure that the goals of the endeavor are clearly defined at the outset and the correct methodologies are employed in data analysis, data sampling, model selection, deployment, and performance evaluation—all as part of a comprehensive framework for conducting analytics consistently and with repeatability.

Roles

Participants play specific parts in each step. These responsibilities are captured in the following four roles:

Business domain expert: A subject matter expert with knowledge of the problem domain
Data engineer: Involved in the collecting, transformation, and cleaning of the data
Project manager: Overseer of the smooth running of the process
Data scientist or machine learning expert: Responsible for applying descriptive or predictive analytic techniques

Process

CRISP (Cross Industry Standard Process) is a well-known high-level process model for data mining that defines the analytics process. In this section, we have added some of our own extensions to the CRISP process that make it more comprehensive and better suited for analytics using machine learning. The entire iterative process is demonstrated in the following schematic figure. We will discuss each step of the process in detail in this section.

Identifying the business problem: Understanding the objectives and the end goals of the project or process is the first step. This is normally carried out by a business domain expert in conjunction with the project manager and machine learning expert. What are the end goals in terms of data availability, formats, specification, collection, ROI, business value, deliverables? All these questions are discussed in this phase of the process. Identifying the goals clearly, and in quantifiable terms where possible, such as dollar amount saved, finding a pre-defined number of anomalies or clusters, or predicting no more than a certain number of false positives, and so on, is an important objective of this phase.
Machine learning mapping: The next step is mapping the business problem to one or more machine learning types discussed in the preceding section. This step is generally carried out by the machine learning expert. In it, we determine whether we should use just one form of learning (for example, supervised, unsupervised, semi-supervised) or if a hybrid of forms is more suitable for the project.
Data collection: Obtaining the raw data in the agreed format and specification for processing follows next. This step is normally carried out by data engineers and may require handling some basic ETL steps.
Data quality analysis: In this step, we perform analysis on the data for missing values, duplicates, and so on, conduct basic statistical analysis on the categorical and continuous types, and similar tasks to evaluate the quality of data. Data engineers and data scientists can perform the tasks together.
Data sampling and transformation: Determining whether data needs to be divided into samples and performing data sampling of various sizes for training, validation, or testing—these are the tasks performed in this step. It consists of employing different sampling techniques, such as oversampling and random sampling of the training datasets for effective learning by the algorithms, especially when the data is highly imbalanced in the labels. The data scientist is involved in this task.
Feature analysis and selection: This is an iterative process combined with modeling in many tasks to make sure the features are analyzed for either their discriminating values or their effectiveness. It can involve finding new features, transforming existing features, handling the data quality issues mentioned earlier, selecting a subset of features, and so on ahead of the modeling process. The data scientist is normally assigned this task.
Machine learning modeling: This is an iterative process working on different algorithms based on data characteristics and learning types. It involves different steps, such as generating hypotheses, selecting algorithms, tuning parameters, and getting results from evaluation to find models that meet the criteria. The data scientist carries out this task.
Model evaluation: While this step is related to all the preceding steps to some degree, it is more closely linked to the business understanding phase and machine learning mapping phase. The evaluation criteria must map in some way to the business problem or the goal. Each problem/project has its own goal, whether that be improving true positives, reducing false positives, finding anomalous clusters or behaviors, or analyzing data for different clusters. Different techniques that implicitly or explicitly measure these targets are used based on learning techniques. Data scientists and business domain experts normally take part in this step.
Model selection and deployment: Based on the evaluation criteria, one or more models—independent or as an ensemble—are selected. The deployment of models normally needs to address several issues: runtime scalability measures, execution specifications of the environment, and audit information, to name a few. Audit information that captures the key parameters based on learning is an essential part of the process. It ensures that model performance can be tracked and compared to check for the deterioration and aging of the models. Saving key information, such as training data volumes, dates, data quality analysis, and so on, is independent of learning types. Supervised learning might involve saving the confusion matrix, true positive ratios, false positive ratios, area under the ROC curve, precision, recall, error rates, and so on. Unsupervised learning might involve clustering or outlier evaluation results, cluster statistics, and so on. This is the domain of the data scientist, as well as the project manager.
Model performance monitoring: This task involves periodically tracking the model performance in terms of the criteria it was evaluated against, such as the true positive rate, false positive rate, performance speed, memory allocation, and so on. It is imperative to measure the deviations in these metrics with respect to the metrics between successive evaluations of the trained model's performance. The deviations and tolerance in the deviation will give insights into repeating the process or retuning the models as time progresses. The data scientist is responsible for this stage.

As may be observed from the preceding diagram, the entire process is an iterative one. After a model or set of models has been deployed, business and environmental factors may change in ways that affect the performance of the solution, requiring a re-evaluation of business goals and success criteria. This takes us back through the cycle again.

Mastering Java Machine Learning

By : Uday Kamath, Krishna Choppella

Mastering Java Machine Learning

By: Uday Kamath, Krishna Choppella

Overview of this book

Related Content you might be interested in

Current Title:

Mastering Java Machine Learning

Machine Learning in Java

Deep Learning with Hadoop

Mastering Machine Learning Algorithms.

Machine learning – roles and process

Roles

Process