Book Image

Practical Data Analysis - Second Edition

By : Hector Cuesta, Dr. Sampath Kumar
Book Image

Practical Data Analysis - Second Edition

By: Hector Cuesta, Dr. Sampath Kumar

Overview of this book

Beyond buzzwords like Big Data or Data Science, there are a great opportunities to innovate in many businesses using data analysis to get data-driven products. Data analysis involves asking many questions about data in order to discover insights and generate value for a product or a service. This book explains the basic data algorithms without the theoretical jargon, and you’ll get hands-on turning data into insights using machine learning techniques. We will perform data-driven innovation processing for several types of data such as text, Images, social network graphs, documents, and time series, showing you how to implement large data processing with MongoDB and Apache Spark.
Table of Contents (21 chapters)
Practical Data Analysis - Second Edition
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface

Tools and toys for this book


The main goal of this book is to provide the reader with self-contained projects ready to deploy, and in order to do this, as you go through the book we will use and implement tools such as Python, D3, and MongoDB. These tools will help you to program and deploy the projects. You also can download all the code from the author's GitHub repository:

https://github.com/hmcuesta

You can see a detailed installation and setup process of all the tools in Appendix, Setting Up the Infrastructure.

Why Python?

Python is a "scripting language" - an interpreted language with its own built-in memory management and good facilities for calling and co-operating with other programs. There are two popular versions, 2.7 or 3.x, and in this book we will be focusing on the 3.x version, because this is under active development and has already seen over two years of stable releases.

Python is multi-platform, runs on Windows, Linux/Unix, and Mac OS X, and has been ported to Java and .NET virtual machines. Python has powerful standard libs and a wealth of third-party packages for numerical computation and machine learning, such as NumPy, SciPy, pandas, SciKit, mlpy, and so on.

Python is excellent for beginners, yet great for experts, is highly scalable, and is also suitable for large projects as well as small ones. It is also easily extensible and object-oriented.

Python is widely used by organizations like Google, Yahoo maps, NASA, Red Hat, Raspberry Pi, IBM, and many more.

http://wiki.python.org/moin/OrganizationsUsingPython

Python has excellent documentation and examples:

http://docs.python.org/3/

The latest Python software is available for free, even for commercial products, and can be downloaded from here:

http://python.org/

Why mlpy?

mlpy (Machine Learning Python) is a module built on top of NumPy, SciPy, and the GNU scientific libraries. It is open source and supports Python 3.x. mlpy has a large number of machine learning algorithms for supervised and unsupervised problems.

Some of the features of mlpy that will be used in this book are as follows:

  • Regression: Support Vector Machines (SVM)

  • Classification: SVM, k-nearest-neighbor (k-NN), classification tree

  • Clustering: k-means, multidimensional scaling

  • Dimensionality Reduction: Principal Component Analysis (PCA)

  • Misc: Dynamic Time Warping (DTW) distance

We can download the latest version of mlpy from here here:http://mlpy.sourceforge.net/

Reference: D. Albanese, R. Visintainer, S. Merler, S. Riccadonna, G. Jurman, C. Furlanello. mlpy: Machine Learning Python, 2012: http://arxiv.org/abs/1202.6548.

Why D3.js?

D3.js (data-driven documents) was developed by Mike Bostock. D3 is a JavaScript library for visualizing data and manipulating the document object model that runs in a browser without a plugin. In D3.js you can manipulate all the elements of the DOM, and it is as flexible as the client-side web technology stack (HTML, CSS, and SVG).

D3.js supports large datasets and includes animation capabilities that make it a really good choice for web visualization.

D3 has excellent documentation, examples and community:

We can download the latest version of D3.js from:

https://d3js.org/

Why MongoDB?

NoSQL is a term that covers different types of data storage technology that are used when you can't fit your business model into a classical relational data model. NoSQL is mainly used in web 2.0 and in social media applications.

MongoDB is a document-based database. This means that MongoDB stores and organizes the data as a collection of documents. That gives you the possibility to store the view models almost exactly as you model them in the application. You can also perform complex searches for data and elementary data mining with MapReduce.

MongoDB is highly scalable, robust, and works perfectly with JavaScript-based web applications because you can store your data in a JSON document and implement a flexible schema, which makes it perfect for unstructured data.

MongoDB is used by well-known corporations like Foursquare, Craigslist, Firebase, SAP, and Forbes; we can see a detailed list of users at:

https://www.mongodb.com/industries

MongoDB has a big and active community, as well as well-written documentation:

http://docs.mongodb.org/manual/

MongoDB is easy to learn and it's free. We can download MongoDB from here:

http://www.mongodb.org/downloads