Machine learning – tools and datasets
A sure way to master the techniques necessary to successfully complete a project of any size or complexity in machine learning is to familiarize yourself with the available tools and frameworks by performing experiments with widely-used datasets, as demonstrated in the chapters to follow. A short survey of the most popular Java frameworks is presented in the following list. Later chapters will include experiments that you will do using the following tools:
RapidMiner: A leading analytics platform, RapidMiner has multiple offerings, including Studio, a visual design framework for processes, Server, a product to facilitate a collaborative environment by enabling sharing of data sources, processes, and practices, and Radoop, a system with translations to enable deployment and execution on the Hadoop ecosystem. RapidMiner Cloud provides a cloud-based repository and on-demand computing power.
License: GPL (Community Edition) and Commercial (Enterprise Edition)
Website: https://rapidminer.com/
Weka: This is a comprehensive open source Java toolset for data mining and building machine learning applications with its own collection of publicly available datasets.
License: GPL
Website: http://www.cs.waikato.ac.nz/ml/weka/
Knime: KNIME (we are urged to pronounce it with a silent k, as "naime") Analytics Platform is written in Java and offers an integrated toolset, a rich set of algorithms, and a visual workflow to do analytics without the need for standard programming languages, such as Java, Python, and R. However, one can write scripts in Java and other languages to implement functionality not available natively in KNIME.
License: GNU GPL v3
Website: https://www.knime.org/
Mallet: This is a Java library for NLP. It offers document classification, sequence tagging, topic modeling, and other text-based applications of machine learning, as well as an API for task pipelines.
License: Common Public License version 1.0 (CPL-1)
Website: http://mallet.cs.umass.edu/
Elki: This is a research-oriented Java software primarily focused on data mining with unsupervised algorithms. It achieves high performance and scalability using data index structures that improve access performance of multi-dimensional data.
License: AGPLv3
Website: http://elki.dbs.ifi.lmu.de/
JCLAL: This is a Java Class Library for Active Learning, and is an open source framework for developing Active Learning methods, one of the areas that deal with learning predictive models from a mix of labeled and unlabeled data (semi-supervised learning is another).
License: GNU General Public License version 3.0 (GPLv3)
Website: https://sourceforge.net/projects/jclal/
KEEL: This is an open source software written in Java for designing experiments primarily suited to the implementation of evolutionary learning and soft computing based techniques for data mining problems.
License: GPLv3
Website: http://www.keel.es/
DeepLearning4J: This is a distributed deep learning library for Java and Scala. DeepLearning4J is integrated with Spark and Hadoop. Anomaly detection and recommender systems are use cases that lend themselves well to the models generated via deep learning techniques.
License: Apache License 2.0
Website: http://deeplearning4j.org/
Spark-MLlib: (Included in Apache Spark distribution) MLlib is the machine learning library included in Spark mainly written in Scala and Java. Since the introduction of Data Frames in Spark, the
spark.ml
package, which is written on top of Data Frames, is recommended over the originalspark.mllib
package. MLlib includes support for all stages of the analytics process, including statistical methods, classification and regression algorithms, clustering, dimensionality reduction, feature extraction, model evaluation, and PMML support, among others. Another aspect of MLlib is the support for the use of pipelines or workflows. MLlib is accessible from R, Scala, and Python, in addition to Java.License: Apache License v2.0
Website: http://spark.apache.org/mllib/
H2O: H2O is a Java-based library with API support in R and Python, in addition to Java. H2O can also run on Spark as its own application called Sparkling Water. H2O Flow is a web-based interactive environment with executable cells and rich media in a single notebook-like document.
License: Apache License v2.0
Website: http://www.h2o.ai/
MOA/SAMOA: Aimed at machine learning from data streams with a pluggable interface for stream processing platforms, SAMOA, at the time of writing, is an Apache Incubator project.
License: Apache License v2.0
Website: https://samoa.incubator.apache.org/
Neo4j: Neo4j is an open source NoSQL graphical database implemented in Java and Scala. As we will see in later chapters, graph analytics has a variety of use cases, including matchmaking, routing, social networks, network management, and so on. Neo4j supports fully ACID transactions.
License: Community Edition—GPLv3 and Enterprise Edition—multiple options, including Commercial and Educational (https://neo4j.com/licensing/)
Website: https://neo4j.com/
GraphX: This is included in the Apache Spark distribution. GraphX is the graph library accompanying Spark. The API has extensive support for viewing and manipulating graph structures, as well as some graph algorithms, such as PageRank, Connected Components, and Triangle Counting.
License: Apache License v2.0
Website: http://spark.apache.org/graphx/
OpenMarkov: OpenMarkov is a tool for editing and evaluating probabilistic graphical models (PGM). It includes a GUI for interactive learning.
License: EUPLv1.1 (https://joinup.ec.europa.eu/community/eupl/og_page/eupl)
Website: http://www.openmarkov.org/
Smile: Smile is a machine learning platform for the JVM with an extensive library of algorithms. Its capabilities include NLP, manifold learning, association rules, genetic algorithms, and a versatile set of tools for visualization.
License: Apache License 2.0
Website: http://haifengl.github.io/smile/
Datasets
A number of publicly available datasets have aided research and learning in data science immensely. Several of those listed in the following section are well known and have been used by scores of researchers to benchmark their methods over the years. New datasets are constantly being made available to serve different communities of modelers and users. The majority are real-world datasets from different domains. The exercises in this volume will use several datasets from this list.
UC Irvine (UCI) database: Maintained by the Center for Machine Learning and Intelligent Systems at UC Irvine, the UCI database is a catalog of some 350 datasets of varying sizes, from a dozen to more than forty million records and up to three million attributes, with a mix of multivariate text, time-series, and other data types. (https://archive.ics.uci.edu/ml/index.html)
Tunedit: (http://tunedit.org/) This offers Tunedit Challenges and tools to conduct repeatable data mining experiments. It also offers a platform for hosting data competitions.
Mldata.org: (http://mldata.org/) Supported by the PASCAL 2 organization that brings together researchers and students across Europe and the world, mldata.org is primarily a repository of user-contributed datasets that encourages data and solution sharing amongst groups of researchers to help with the goal of creating reproducible solutions.
KDD Challenge Datasets: (http://www.kdnuggets.com/datasets/index.html) KDNuggets aggregates multiple dataset repositories across a wide variety of domains.
Kaggle: Billed as the Home of Data Science, Kaggle is a leading platform for data science competitions and also a repository of datasets from past competitions and user-submitted datasets.