Python Data Science Essentials

Python Data Science Essentials - Second Edition

By : Luca Massaron, Alberto Boschetti

Buy this Book

Python Data Science Essentials - Second Edition

By: Luca Massaron, Alberto Boschetti

Buy this Book

Overview of this book

Fully expanded and upgraded, the second edition of Python Data Science Essentials takes you through all you need to know to suceed in data science using Python. Get modern insight into the core of Python data, including the latest versions of Jupyter notebooks, NumPy, pandas and scikit-learn. Look beyond the fundamentals with beautiful data visualizations with Seaborn and ggplot, web development with Bottle, and even the new frontiers of deep learning with Theano and TensorFlow. Dive into building your essential Python 3.5 data science toolbox, using a single-source approach that will allow to to work with Python 2.7 as well. Get to grips fast with data munging and preprocessing, and all the techniques you need to load, analyse, and process your data. Finally, get a complete overview of principal machine learning algorithms, graph analysis techniques, and all the visualization and deployment instruments that make it easier to present your results to an audience of both data science experts and business users.

Python Data Science Essentials - Second Edition

Credits

About the Authors

About the Reviewer

www.PacktPub.com

Preface

Free Chapter

First Steps

Introducing data science and Python

Installing Python

Introducing Jupyter

Datasets and code used in the book

Summary

Data Munging

The data science process

Data loading and preprocessing with pandas

Working with categorical and text data

Data processing with NumPy

Creating NumPy arrays

NumPy's fast operations and computations

Summary

The Data Pipeline

Introducing EDA

Building new features

Dimensionality reduction

The detection and treatment of outliers

Validation metrics

Testing and validating

Cross-validation

Hyperparameter optimization

Feature selection

Wrapping everything in a pipeline

Summary

Machine Learning

Preparing tools and datasets

Linear and logistic regression

Dealing with big data

Approaching deep learning

A peek at Natural Language Processing (NLP)

An overview of unsupervised learning

Summary

Social Network Analysis

Introduction to graph theory

Graph algorithms

Graph loading, dumping, and sampling

Summary

Visualization, Insights, and Results

Introducing the basics of matplotlib

Wrapping up matplotlib's commands

Interactive visualizations with Bokeh

Advanced data-learning representations

Summary

Strengthen Your Python Foundations

Your learning list

Learn by watching, reading, and doing

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Your learning list

Here are the basic Python data structures that you need to learn to be as proficient as a data scientist. Leaving aside the real basics (numbers, arithmetic, strings, Booleans, variable assignments, and comparisons), the list is indeed short. We will briefly deal with it by touching upon only the recurrent structures in data science projects. Remember that the topics are quite challenging, but they are necessary if you want to write effective code:

Lists
Dictionaries
Classes, objects, and Object-Oriented Programming (OOP)
Exceptions
Iterators and generators
Conditionals
Comprehensions
Functions

Take it as a refresher or a learning list depending on your actual knowledge of the Python language. However, examine all the proposed examples because you will come across them again during the course of the book.

Lists

Lists are collections of elements. Elements can be integers, floats, strings, or generically, objects. Moreover, you can mix different types together. Besides, lists are more flexible than arrays since arrays allow only a single datatype.

To create a list, you can either use the square brackets or the list() constructor, as follows:

a_list = [1, 2.3, 'a', True]
an_empty_list = list()

The following are some handy methods that you can remember while working with lists:

To access the ith element, use the [] notation:

Note

Remember that lists are indexed from 0 (zero); that is, the first element is in position 0.

         a_list[1] 
         # prints 2.3 
         a_list[1] = 2.5 
         # a_list is now [1, 2.5, 'a', True]

You can slice lists by pointing out a starting and ending point (the ending point is not included in the resulting slice), as follows:

        a_list[1:3] # prints [2.3, 'a']

You can slice with skips by using a colon-separated start:end:skip notation so that you can get an element for every skip value, as follows:

        a_list[::2] 
        # returns only odd elements: [1, 'a'] 
        a_list[::-1] 
        # returns the reverse of the list: [True, 'a', 2.3, 1]

To append an element at the end of the list, you can use append():

        a_list.append(5) 
        # a_list is now [1, 2.5, 'a', True, 5]

To get the length of the list, use the len()function, as follows:

        len(a_list) 
        # prints 5

To delete an element, use the del statement followed by the element that you wish to remove:

        dela_list[0] 
        # a_list is now [2.5, 'a', True, 5]

To concatenate two lists, use +, as follows:

        a_list += [1, 'b'] 
        # a_list is now [2.5, 'a', True, 5, 1, 'b']

You can unpack lists by assigning lists to a list (or simply a sequence) of variables instead of a single variable:

        a,b,c,d,e,f = [2.5, 'a', True, 5, 1, 'b'] 
        # a now is 2.5, b is 'a' and so on

Remember that lists are mutable data structures; you can always append, remove, and modify elements. Immutable lists are called tuples and are denoted with round parentheses, ( and ), instead of the square brackets as in the list, [ and ]:

tuple(a_list) 
# prints (2.5, 'a', True, 5, 1, 'b')

Dictionaries

Dictionaries are tables that can find stuff very fast because each key is associated with a value. It is really like using the index of a book to jump immediately to the content you need. Keys and values can belong to different kinds of data types. The only requisite for keys is that they should be hashable (that's a fairly complex concept; simply keep the keys as simple as possible and, therefore, don't try to use a dictionary or a list as a key).

To create a dictionary, you can use curly brackets, as follows:

b_dict = {1: 1, '2': '2', 3.0: 3.0}

The following are some handy methods that you can remember while working with dictionaries:

To access the value indexed by the k key, use the [] notation, as follows:

        b_dict['2'] 
        # prints '2'  
        b_dict['2'] = '2.0' 
        # b_dict is now {1: 1, '2': '2.0', 3.0: 3.0}

To insert or replace a value for a key, use the [] notation again:

        b_dict['a'] = 'a' 
        # b_dict is now {3.0: 3.0, 1: 1, '2': '2.0', 'a': 'a'}

To get the number of elements in the dictionary, use the len() function, as follows:

       len(b_dict) 
       # prints 4

To delete an element, use the del statement followed by the element that you wish to remove:

        delb_dict[3.0] 
        # b_dict is now {1: 1, '2': '2.0', 'a': 'a'}

Remember that dictionaries, like lists, are mutable data structures. Also remember that if you try to access an element whose key doesn't exist, a KeyError exception will be raised:

b_dict['a_key'] 
Traceback (most recent call last): 
  File "<stdin>", line 1, in <module> 
KeyError: 'a_key'

The obvious solution to this is to always check first whether an element is in the dictionary:

if  'a_key' in b_dict: 
b_dict['a_key'] 
else: 
print ("'a_key' is not present in the dictionary")

Otherwise, you can use the .get method. If the key is in the dictionary, it returns its value; otherwise, it returns None:

b_dict.get('a_key')

Finally, you can use a data structure from the collections module, called defaultdict, and it will never raise a KeyError because it is instantiated by a function taking no arguments and providing the default value for any nonexistent key it may want you to require:

from collections import defaultdict 
c_dict = defaultdict(lambda: 'empty') 
c_dict['a_key'] 
# requiring a nonexistent key will always return the string 'empty'

The default function to be used by defaultdict can be defined using a def or lambda command, as described in the following section.

Defining functions

Functions are ensembles of instructions that usually receive specific inputs from you and provide a set of specific outputs related to these inputs.

You can define them as one-liners, as follows:

def half(x) : return x/2.0

You can also define them as a set of many instructions in the following way:

import math 
def sigmoid(x): 
try: 
return 1.0 / (1 + math.exp(-x)) 
except: 
if x < 0: 
return 0.0 
else: 
return 1.0

Finally, you can define on the fly an anonymous function by using a lambda function. Think about them as simple functions that you can define inline everywhere in the code, without using the "verbose" constructor for functions (the one starting with def). Just call lambda followed by its input parameters; then a colon will signal the beginning of the commands to be executed by the lambda function, which necessarily have to be on the same line. (No return command! The commands are what will be returned from the lambda function.)

You can use a lambda function as a parameter in another function, as seen previously for defaultdict, or you can use it in order to express a function in one line. This is the case in our example, where we define a function returning a lambda function incorporating the parameters of the first one:

defsum_a_const(c):  
   return lambda x: x+c 
 
sum_2 = sum_a_const(2) 
sum_3 = sum_a_const(3) 
print (sum_2(2)) 
print (sum_3(2)) 
# prints 4 and 5

To invoke a function, write the function name, followed by its parameters within the parenthesis:

half(10) 
# prints 5.0 
sigmoid(0) 
# prints 0.5

By using functions, you ensemble repetitive procedures by formalizing their inputs and outputs without letting their calculation interfere in any way with the execution of the main program. In fact, unless you declare that a variable is a global one, all the variables you used inside your function will be disposed, and your main program will receive only what has been returned by the return command.

Note

By the way, please be aware that if you pass a list to a function-only list, which won't happen with variables—this will be modified, even if not returned, unless you copy it. In order to make a duplicate of a list, you can use the copy or deep copy functions (to be imported from the copy package) or simply the operator [:] applied to your list.

Why does this happen? Because lists are in particular data structures that are referenced by an address and not by the entire object. So, when you pass a list to a function, you are just passing an address to the memory of your computer, and the function will operate on that address by modifying your actual list:

a_list = [1,2,3,4,5] 
def modifier(L): 
L[0] = 0 
defunmodifier(L): 
  M = L[:] 
M[0] = 1 
unmodifier(a_list) 
print (a_list) # you still have the original list, [1, 2, 3, 4, 5] 
modifier(a_list) 
print (a_list) # your list have been modified: [0, 2, 3, 4, 5]

Classes, objects, and OOP

Classes are collections of methods and attributes. Briefly, attributes are variables of the object (for example, each instance of the Employee class has its own name, age, salary, and benefits; all of them are attributes). Methods are simply functions that modify attributes (for example, to set the employee name, to set his/her age, and also to read this info from a database or from a CSV list). To create a class, use the class keyword. In the following example, we will create a class for an incrementer. The purpose of this object is to keep track of the value of an integer and eventually increase it by 1:

class Incrementer(object): 
def __init__(self): 
print ("Hello world, I'm the constructor") 
        self._i = 0

Everything within the def indentation is a class method. In this case, the method named __init__ sets the i internal variable to zero (it looks exactly like a function described in the previous chapter). Look carefully at the method's definition. Its argument is self (this is the object itself), and every internal variable access is made through self. Moreover, __init__ is not just a method; it's the constructor (it's called when the object is created). In fact, when we build an Incrementer object, this method is automatically called, as follows:

i = Incrementer() 
# prints "Hello world, I'm the constructor"

Now, let's create the increment() method, which increments the i internal counter and returns the status. Within the class definition, include the method:

def increment(self): 
        self._i += 1 
return self._i

Then, run the following code:

i = Incrementer() 
print (i.increment()) 
print (i.increment()) 
print (i.increment())

The preceding code results in the following output:

Hello world, I'm the constructor 
1 
2 
3

Finally, let's see how to create methods that accept parameters. We will now create the set_counter method, which sets the _i internal variable.

Within the class definition, add the following code:

defset_counter(self, counter): 
        self._i = counter

Then, run the following code:

i = Incrementer() 
i.set_counter(10) 
print (i.increment()) 
print (i._i)

The preceding code gives this output:

Hello world, I'm the constructor 
11 
11

Note

Note the last line of the preceding code, where you access the internal variable. Remember that in Python, all the internal attributes of the objects are public by default, and they can be read, written, and changed externally.

Exceptions

Exceptions and errors are strongly correlated, but they are different things. An exception, for example, can be gracefully handled. Here are some examples of exceptions:

0/0 
Traceback (most recent call last): 
  File "<stdin>", line 1, in <module> 
ZeroDivisionError: integer division or modulo by zero 
 
len(1, 2) 
Traceback (most recent call last): 
  File "<stdin>", line 1, in <module> 
TypeError: len() takes exactly one argument (2 given) 
 
pi * 2 
Traceback (most recent call last): 
  File "<stdin>", line 1, in <module> 
NameError: name 'pi' is not defined

In this example, three different exceptions have been raised (see the last line of each block). To handle exceptions, you can use a try/except block in the following way:

try: 
    a = 10/0 
exceptZeroDivisionError: 
    a = 0

You can use more than one except clause to handle more than one exception. You can eventually use a final "all-the-other" exception case handle. In this case, the structure is as follows:

try: 
<code which can raise more than one exception> 
exceptKeyError: 
print ("There is a KeyError error in the code") 
except (TypeError, ZeroDivisionError): 
print ("There is a TypeError or a ZeroDivisionError error in the code") 
except: 
print ("There is another error in the code")

Finally, it is important to mention that there is the final clause, finally, that will be executed in all circumstances. It's very handy if you want to clean up the code (closing files, de-allocating resources, and so on). These are the things that should be done independently, regardless of whether an error has occurred or not. In this case, the code assumes the following shape:

try: 
<code that can raise exceptions> 
except: 
<eventually more handlers for different exceptions> 
finally: 
<clean-up code>

Iterators and generators

Looping through a list or a dictionary is very simple. Note that with dictionaries, the iteration is key-based, which is demonstrated in the following example:

for entry in ['alpha', 'bravo', 'charlie', 'delta']: 
  print (entry) 
# prints the content of the list, one entry for line 
 
a_dict = {1: 'alpha', 2: 'bravo', 3: 'charlie', 4: 'delta'} 
for key in a_dict: 
  print (key, a_dict[key]) 
 
# Prints: 
# 1 alpha 
# 2 bravo 
# 3 charlie 
# 4 delta

On the other hand, if you need to iterate through a sequence and generate objects on the fly, you can use a generator. A great advantage of doing this is that you don't have to create and store the complete sequence at the beginning. Instead, you build every object every time the generator is called. As a simple example, let's create a generator for a number sequence without storing the complete list in advance:

def incrementer(): 
  i = 0     
  whilei<5: 
    yield i 
    i +=1 
 
for i in incrementer(): 
print (i) 
 
# Prints: 
# 0 
# 1 
# 2 
# 3 
# 4

Conditionals

Conditionals are often used in data science since you can branch the program. The most frequently used one is the if statement. It works more or less the same as in other programming languages. Here's an example of it:

def is_positive(val): 
  ifval< 0: 
    print ("It is negative") 
elif val> 0: 
  print ("It is positive") 
else: 
  print ("It is exactly zero!") 
 
 
is_positive(-1) 
is_positive(1.5) 
is_positive(0) 
 
 
# Prints: 
# It is negative 
# It is positive 
# It is exactly zero!

The first condition is checked with if. If there are any other conditions, they are defined with elif (this stands for else-if). Finally, the default behavior is handled by else.

Note

Note that elif and else are not essentials.

Comprehensions for lists and dictionaries

Comprehensions, lists, and dictionaries are built as one-liners with the use of an iterator and a conditional when necessary:

a_list = [1,2,3,4,5] 
another_list = ['a','b','c','d','e'] 
a_power_list = [value**2 for value in a_list] 
# the resulting list is [1, 4, 9, 16, 25] 
filter_even_numbers = [value**2 for value in a_list if value % 2  == 0] 
# the resulting list is [4, 16] 
a_dictionary = {key:value for value, key in zip(a_list,  another_list)} 
# zip is a function that takes as input multiple lists of the same  length and iterates through each element having the same index at  the same time, so you can match the first elements of every lists  together, and so on. 
# the resulting dictionary is {'a': 1, 'c': 3, 'b': 2, 'e': 5,  'd': 4}

Comprehensions are a fast way to filter and transform data that is present in any iterator.

Python Data Science Essentials - Second Edition

By : Luca Massaron, Alberto Boschetti

Python Data Science Essentials - Second Edition

By: Luca Massaron, Alberto Boschetti

Overview of this book

Related Content you might be interested in

Current Title:

Python Data Science Essentials - Second Edition

Your learning list

Lists

Note

Dictionaries

Defining functions

Note

Classes, objects, and OOP

Note

Exceptions

Iterators and generators

Conditionals

Note

Comprehensions for lists and dictionaries