Book Image

Apache Spark 2.x Machine Learning Cookbook

By : Mohammed Guller, Siamak Amirghodsi, Shuen Mei, Meenakshi Rajendran, Broderick Hall
Book Image

Apache Spark 2.x Machine Learning Cookbook

By: Mohammed Guller, Siamak Amirghodsi, Shuen Mei, Meenakshi Rajendran, Broderick Hall

Overview of this book

Machine learning aims to extract knowledge from data, relying on fundamental concepts in computer science, statistics, probability, and optimization. Learning about algorithms enables a wide range of applications, from everyday tasks such as product recommendations and spam filtering to cutting edge applications such as self-driving cars and personalized medicine. You will gain hands-on experience of applying these principles using Apache Spark, a resilient cluster computing system well suited for large-scale machine learning tasks. This book begins with a quick overview of setting up the necessary IDEs to facilitate the execution of code examples that will be covered in various chapters. It also highlights some key issues developers face while working with machine learning algorithms on the Spark platform. We progress by uncovering the various Spark APIs and the implementation of ML algorithms with developing classification systems, recommendation engines, text analytics, clustering, and learning systems. Toward the final chapters, we’ll focus on building high-end applications and explain various unsupervised methodologies and challenges to tackle when implementing with big data ML systems.
Table of Contents (20 chapters)
Title Page
Credits
About the Authors
About the Reviewer
www.PacktPub.com
Customer Feedback
Preface

Identifying data sources for practical machine learning


Getting data for machine learning projects a challenge in the past. However, now there is a rich set of public data sources specifically suitable for machine learning.

Getting ready

In addition to the university and government sources, there are many other open sources of data that can be used to learn and code your own examples and projects. We will list the data sources and show you how to best obtain and download data for each chapter.

How to do it...

The following is a list of open source data worth exploring if you would like to develop applications in this field:

  • UCI machine learning repository: This is an extensive library with search functionality. At the time of writing, there were more than 350 datasets. You can click on the https://archive.ics.uci.edu/ml/index.html link to see all the datasets or look for a specific set using a simple search (Ctrl + F).
  • Kaggle datasets: You need to create an account, but you can download any sets for learning as well as for competing in machine learning competitions. The https://www.kaggle.com/competitions link provides details for exploring and learning more about Kaggle, and the inner workings of machine learning competitions.
  • MLdata.org: A public site open to all with a repository of datasets for machine learning enthusiasts.
  • Google Trends: You can find statistics on search volume (as a proportion of total search) for any given term since 2004 on http://www.google.com/trends/explore.
  • The CIA World Factbook: The https://www.cia.gov/library/publications/the-world-factbook/ link provides information on the history, population, economy, government, infrastructure, and military of 267 countries.

See also

Other sources for learning data:

There are some datasets (for example, text analytics in Spanish, and gene and IMF data) that might be of some interest to you: