Book Image

Practical Data Analysis

By : Hector Cuesta
Book Image

Practical Data Analysis

By: Hector Cuesta

Overview of this book

Plenty of small businesses face big amounts of data but lack the internal skills to support quantitative analysis. Understanding how to harness the power of data analysis using the latest open source technology can lead them to providing better customer service, the visualization of customer needs, or even the ability to obtain fresh insights about the performance of previous products. Practical Data Analysis is a book ideal for home and small business users who want to slice and dice the data they have on hand with minimum hassle.Practical Data Analysis is a hands-on guide to understanding the nature of your data and turn it into insight. It will introduce you to the use of machine learning techniques, social networks analytics, and econometrics to help your clients get insights about the pool of data they have at hand. Performing data preparation and processing over several kinds of data such as text, images, graphs, documents, and time series will also be covered.Practical Data Analysis presents a detailed exploration of the current work in data analysis through self-contained projects. First you will explore the basics of data preparation and transformation through OpenRefine. Then you will get started with exploratory data analysis using the D3js visualization framework. You will also be introduced to some of the machine learning techniques such as, classification, regression, and clusterization through practical projects such as spam classification, predicting gold prices, and finding clusters in your Facebook friends' network. You will learn how to solve problems in text classification, simulation, time series forecast, social media, and MapReduce through detailed projects. Finally you will work with large amounts of Twitter data using MapReduce to perform a sentiment analysis implemented in Python and MongoDB. Practical Data Analysis contains a combination of carefully selected algorithms and data scrubbing that enables you to turn your data into insight.
Table of Contents (24 chapters)
Practical Data Analysis
About the Author
About the Reviewers

Installing and running MongoDB

According to the official website, MongoDB (from humongous) is an open source document database, and the leading NoSQL database. Written in C++, MongoDB features:

  • Document-oriented storage: JSON-style documents with dynamic schemas that offer simplicity and power

  • Full index support: Index on any attribute, just like you're used to

  • Replication and high availability: Mirror across LANs and WANs for scale and peace of mind

  • Auto-sharding: Scale horizontally without compromising functionality

  • Querying: Rich document-based queries

  • Fast in-place updates: Atomic modifiers for contention-free performance

  • Map/Reduce: Flexible aggregation and data processing

  • GridFS: Store files of any size without complicating your stack

  • Commercial support: Enterprise class support, training, and consulting available

Installing and running MongoDB on Ubuntu

The easiest way to install MongoDB is through Ubuntu Software Center, as showed in the following screenshot:

Finally, just open a terminal and execute mongo, as shown in the following screenshot:

$ mongo

To check whether everything is installed correctly, just execute the Mongo shell as shown in the following screenshot. Insert a record in the test collection and retrieve that record:

> { a: 1 } )
> { a: 100 } )
> db.test.find()

Installing and running MongoDB on Windows

Download the latest production release of MongoDB from the official website,

There are two builds of MongoDB for Windows:

  • MongoDB for Windows 64-bit runs on any 64-bit version of Windows newer than Windows XP, including Windows Server 2008 R2 and Windows 7 64-bit.

  • MongoDB for Windows 32-bit runs on any 32-bit version of Windows newer than Windows XP. 32-bit versions of MongoDB are only used in testing and development systems (is limited to less of 2GB for storage capacity).

Unzip in a folder such as c:\mongodb\.

MongoDB requires a data folder to store its files:


Then to start MongoDB, we need to execute mongod.exe from the command prompt (c:\mongodb\bin\mongod.exe) as shown in the following screenshot:


You can specify an alternate path for c:\data\db, with the dbpath setting for mongod.exe, as in the following example:

C:\mongodb\bin\mongod.exe --dbpath c:\mongodb\data\

You can get the full list of command-line options by running mongod with the --help option:

C:\mongodb\bin\mongod.exe --help

Finally, just execute mongo.exe and the Mongo browser shell is ready to use, as shown in the following screenshot:



MongoDB is running on the localhost interface and port 27017 by default. If you want to change the port, you need to use the –port option of the mongod command.

To check whether everything is installed correctly, just run the Mongo shell as shown in the following screenshot. Insert a record in the test collection and retrieve that record:

> { a: 1 } )
> { a: 100 } )
> db.test.find()

Connecting Python with MongoDB

The most popular module for working with MongoDB from Python is pymongo, it can be easily installed in Linux using pip, as shown in the following command:

$ pip install pymongo


You may have installed multiple versions of Python. In that case, you may want to use virtualenv of Python3, and then install packages after activating virtualenv.

Installing python-virtualenv:

$ sudo apt-get install python-virtualenv

Setting up the virtualenv:

$ virtualenv -p /usr/bin/python3 py3env
$ source py3env/bin/activate

Installing packages for Python 3

$ pip install "package-name"

In Windows, we can install pymongo using easy_install, opening a command prompt, and executing the following command:

C:/> easy-install pymongo

To check whether everything is installed correctly, just execute the Python shell as shown in the following code. Insert a record in the test_rows collection and retrieve that record:

>>> from pymongo import MongoClient 
>>> con = MongoClient() 
>>> db = con.test
>>> test_row = {'a':'200'}
>>> test_rows = db.rows
>>> test_rows.insert(test_row)
>>> result = test_rows.find()
>>> for x in result: print(x) 
{'a':'200', 'id': ObjectId('5150c46b042a1824a78468b5')}