Getting Started with Python Data Analysis

Getting Started with Python Data Analysis

Overview of this book

Data analysis is the process of applying logical and analytical reasoning to study each component of data. Python is a multi-domain, high-level, programming language. It’s often used as a scripting language because of its forgiving syntax and operability with a wide variety of different eco-systems. Python has powerful standard libraries or toolkits such as Pylearn2 and Hebel, which offers a fast, reliable, cross-platform environment for data analysis. With this book, we will get you started with Python data analysis and show you what its advantages are. The book starts by introducing the principles of data analysis and supported libraries, along with NumPy basics for statistic and data processing. Next it provides an overview of the Pandas package and uses its powerful features to solve data processing problems. Moving on, the book takes you through a brief overview of the Matplotlib API and some common plotting functions for DataFrame such as plot. Next, it will teach you to manipulate the time and data structure, and load and store data in a file or database using Python packages. The book will also teach you how to apply powerful packages in Python to process raw data into pure and helpful data using examples. Finally, the book gives you a brief overview of machine learning algorithms, that is, applying data analysis results to make decisions or build helpful products, such as recommendations and predictions using scikit-learn.

Getting Started with Python Data Analysis

Credits

About the Authors

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Introducing Data Analysis and Libraries

Data analysis and processing

An overview of the libraries in data analysis

Python libraries in data analysis

Summary

NumPy Arrays and Vectorized Computation

NumPy arrays

Array functions

Data processing using arrays

Linear algebra with NumPy

NumPy random numbers

Summary

Data Analysis with Pandas

An overview of the Pandas package

The Pandas data structure

The essential basic functionality

Indexing and selecting data

Computational tools

Working with missing data

Advanced uses of Pandas for data analysis

Summary

Data Visualization

The matplotlib API primer

Exploring plot types

Legends and annotations

Plotting functions with Pandas

Additional Python data visualization tools

Summary

Time Series

Time series primer

Working with date and time objects

Resampling time series

Downsampling time series data

Upsampling time series data

Time zone handling

Timedeltas

Time series plotting

Summary

Interacting with Databases

Interacting with data in text format

Interacting with data in binary format

Interacting with data in MongoDB

Interacting with data in Redis

Summary

Data Analysis Application Examples

Data munging

Data aggregation

Grouping data

Summary

Machine Learning Models with scikit-learn

An overview of machine learning models

The scikit-learn modules for different models

Data representation in scikit-learn

Supervised learning – classification and regression

Unsupervised learning – clustering and dimensionality reduction

Measuring prediction performance

Summary

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Preface

The world generates data at an increasing pace. Consumers, sensors, or scientific experiments emit data points every day. In finance, business, administration and the natural or social sciences, working with data can make up a significant part of the job. Being able to efficiently work with small or large datasets has become a valuable skill.

There are a variety of applications to work with data, from spreadsheet applications, which are widely deployed and used, to more specialized statistical packages for experienced users, which often support domain-specific extensions for experts.

Python started as a general purpose language. It has been used in industry for a long time, but it has been popular among researchers as well. Around ten years ago, in 2006, the first version of NumPy was released, which made Python a first class language for numerical computing and laid the foundation for a prospering development, which led to what we today call the PyData ecosystem: A growing set of high-performance libraries to be used in the sciences, finance, business or anywhere else you want to work efficiently with datasets.

In contrast to more specialized applications and environments, Python is not only about data analysis. The list of industrial-strength libraries for many general computing tasks is long, which makes working with data in Python even more compelling. Whether your data lives inside SQL or NoSQL databases or is out there on the Web and must be crawled or scraped first, the Python community has already developed packages for many of those tasks.

And the outlook seems bright. Working with bigger datasets is getting simpler and sharing research findings and creating interactive programming notebooks has never been easier. It is the perfect moment to learn about data analysis in Python. This book lets you get started with a few core libraries of the PyData ecosystem: Numpy, Pandas, and matplotlib. In addition, two NoSQL databases are introduced. The final chapter will take a quick tour through one of the most popular machine learning libraries in Python.

We hope you find Python a valuable tool for your everyday data work and that we can give you enough material to get productive in the data analysis space quickly.

What this book covers

Chapter 1, Introducing Data Analysis and Libraries, describes the typical steps involved in a data analysis task. In addition, a couple of existing data analysis software packages are described.

Chapter 2, NumPy Arrays and Vectorized Computation, dives right into the core of the PyData ecosystem by introducing the NumPy package for high-performance computing. The basic data structure is a typed multidimensional array which supports various functions, among them typical linear algebra tasks. The data structure and functions are explained along with examples.

Chapter 3, Data Analysis with Pandas, introduces a prominent and popular data analysis library for Python called Pandas. It is built on NumPy, but makes a lot of real-world tasks simpler. Pandas comes with its own core data structures, which are explained in detail.

Chapter 4, Data Visualizaiton, focuses on another important aspect of data analysis: the understanding of data through graphical representations. The Matplotlib library is introduced in this chapter. It is one of the most popular 2D plotting libraries for Python and it is well integrated with Pandas as well.

Chapter 5, Time Series, shows how to work with time-oriented data in Pandas. Date and time handling can quickly become a difficult, error-prone task when implemented from scratch. We show how Pandas can be of great help there, by looking in detail at some of the functions for date parsing and date sequence generation.

Chapter 6, Interacting with Databases, deals with some typical scenarios. Your data does not live in vacuum, and it might not always be available as CSV files either. MongoDB is a NoSQL database and Redis is a data structure server, although many people think of it as a key value store first. Both storage systems are introduced to help you interact with data from real-world systems.

Chapter 7, Data Analysis Application Examples, applies many of the things covered in the previous chapters to deepen your understanding of typical data analysis workflows. How do you clean, inspect, reshape, merge, or group data – these are the concerns in this chapter. The library of choice in the chapter will be Pandas again.

Chapter 8, Machine Learning Models with scikit-learn, would like to make you familiar with a popular machine learning package for Python. While it supports dozens of models, we only look at four models, two supervised and two unsupervised. Even if this is not mentioned explicitly, this chapter brings together a lot of the existing tools. Pandas is often used for machine learning data preparation and matplotlib is used to create plots to facilitate understanding.

What you need for this book

There are not too many requirements to get started. You will need a Python programming environment installed on your system. Under Linux and Mac OS X, Python is usually installed by default. Installation on Windows is supported by an excellent installer provided and maintained by the community.

This book uses a recent Python 2, but many examples will work with Python 3 as well.

The versions of the libraries used in this book are the following: NumPy 1.9.2, Pandas 0.16.2, matplotlib 1.4.3, tables 3.2.2, pymongo 3.0.3, redis 2.10.3, and scikit-learn 0.16.1. As these packages are all hosted on PyPI, the Python package index, they can be easily installed with pip. To install NumPy, you would write:

$ pip install numpy

If you are not using them already, we suggest you take a look at virtual environments for managing isolating Python environment on your computer. For Python 2, there are two packages of interest there: virtualenv and virtualenvwrapper. Since Python 3.3, there is a tool in the standard library called pyvenv (https://docs.python.org/3/library/venv.html), which serves the same purpose.

Most libraries will have an attribute for the version, so if you already have a library installed, you can quickly check its version:

>>> import redis
>>> redis.__version__
'2.10.3'

This works well for most libraries. A few, such as pymongo, use a different attribute (pymongo uses just version, without the underscores).

While all the examples can be run interactively in a Python shell, we recommend using IPython. IPython started as a more versatile Python shell, but has since evolved into a powerful tool for exploration and sharing. We used IPython 4.0.0 with Python 2.7.10. IPython is a great way to work interactively with Python, be it in the terminal or in the browser.

Who this book is for

We assume you have been exposed to programming and Python and you want to broaden your horizons and get to know some key libraries in the data analysis field. We think that people with different backgrounds can benefit from this book. If you work in business, finance, in research and development at a lab or university, or if your work contains any data processing or analysis steps and you want know what Python has to offer, then this book can be of help. If you want to get started with basic data processing tasks or time series, then you can find lot of hands-on knowledge in the examples of this book. The strength of this book is its breadth. While we cannot dive very deep into a single package – although we will use Pandas extensively - we hope that we can convey a bigger picture: how the different parts of the Python data ecosystem work and can work together to form one of the most innovative and engaging programming environments.

Conventions

In this book, you will find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "We can include other contexts through the use of the include directive."

A block of code is set as follows:

>>> import numpy as np
>>> np.random.randn()

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

>>> import pandas as pd

Any command-line input or output is written as follows:

$ cat "data analysis" | wc -l

New terms and important words are shown in bold. Words that you see on the screen, in menus or dialog boxes for example, appear in the text like this: "clicking the Next button moves you to the next screen".

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.

To send us general feedback, simply send an e-mail to <[email protected]>, and mention the book title via the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you would report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the errata submission form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded on our website, or added to any list of existing errata, under the Errata section of that title. Any existing errata can be viewed by selecting your title from http://www.packtpub.com/support.

Piracy

Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors, and our ability to bring you valuable content.

Questions

You can contact us at <[email protected]> if you are having a problem with any aspect of the book, and we will do our best to address it.

Getting Started with Python Data Analysis

Getting Started with Python Data Analysis

Overview of this book

Related Content you might be interested in

Current Title:

Getting Started with Python Data Analysis

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Note

Tip

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions