Book Image

Mastering Python for Data Science

By : Samir Madhavan
Book Image

Mastering Python for Data Science

By: Samir Madhavan

Overview of this book

Table of Contents (19 chapters)
Mastering Python for Data Science
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
7
Estimating the Likelihood of Events
Index

Data cleansing


The data in its raw form generally requires some cleaning so that it can be analyzed or a dashboard can be created on it. There are many reasons that data might have issues. For example, the Point of Sale system at a retail shop might have malfunctioned and inputted some data with missing values. We'll be learning how to handle such data in the following section.

Checking the missing data

Generally, most data will have some missing values. There could be various reasons for this: the source system which collects the data might not have collected the values or the values may never have existed. Once you have the data loaded, it is essential to check the missing elements in the data. Depending on the requirements, the missing data needs to be handled. It can be handled by removing a row or replacing a missing value with an alternative value.

In the Student Weight data, to check if the location column has missing value, the following command can be utilized:

>>> d['Location 1'].isnull()
0       False
1       False
2       False
3       False
4       False
5       False
6       False

The notnull() method will output each row of the value as TRUE or FALSE. If it's False, then there is a missing value. This data can be aggregated to find the number of instances of the missing value:

>>> d['Location 1'].isnull().value_counts()
False    3246
True       24
dtype: int64

The preceding command shows that the Location 1 column has 24 instances of missing values. These missing values can be handled by either removing the rows with the missing values or replacing it with some values. To remove the rows, execute the following command:

>>> d = d['Location 1'].dropna()

To remove all the rows with an instance of missing values, use the following command:

>>> d = d.dropna(how='any')

Filling the missing data

Let's define some DataFrames to work with:

>>> df = pd.DataFrame(np.random.randn(5, 3), index=['a0', 'a10', 'a20', 'a30', 'a40'],
                  columns=['X', 'Y', 'Z'])
>>> df
            X         Y         Z
a0  -0.854269  0.117540  1.515373
a10 -0.483923 -0.379934  0.484155
a20 -0.038317  0.196770 -0.564176
a30  0.752686  1.329661 -0.056649
a40 -1.383379  0.632615  1.274481

We'll now add some extra row indexes, which will create null values in our DataFrame:

>>> df2 = df2.reindex(['a0', 'a1', 'a10', 'a11', 'a20', 'a21', 'a30', 'a31', 'a40', 'a41'])
>>> df2

            X         Y         Z
a0  -1.193371  0.912654 -0.780461
a1        NaN       NaN       NaN
a10  1.413044  0.615997  0.947334
a11       NaN       NaN       NaN
a20  1.583516  1.388921  0.458771
a21       NaN       NaN       NaN
a30  0.479579  1.427625  1.407924
a31       NaN       NaN       NaN
a40  0.455510 -0.880937  1.375555
a41       NaN       NaN       NaN

If you want to replace the null values in the df2 DataFrame with a value of zero in the following case, execute the following command:

>>> df2.fillna(0)

            X         Y         Z
a0  -1.193371  0.912654 -0.780461
a1   0.000000  0.000000  0.000000
a10  1.413044  0.615997  0.947334
a11  0.000000  0.000000  0.000000
a20  1.583516  1.388921  0.458771
a21  0.000000  0.000000  0.000000
a30  0.479579  1.427625  1.407924
a31  0.000000  0.000000  0.000000
a40  0.455510 -0.880937  1.375555
a41  0.000000  0.000000  0.000000

If you want to fill the value with forward propagation, which means that the value previous to the null value in the column will be used to fill the null value, the following command can be used:

>>> df2.fillna(method='pad') #filling with forward propagation

            X         Y         Z
a0  -1.193371  0.912654 -0.780461
a1  -1.193371  0.912654 -0.780461
a10  1.413044  0.615997  0.947334
a11  1.413044  0.615997  0.947334
a20  1.583516  1.388921  0.458771
a21  1.583516  1.388921  0.458771
a30  0.479579  1.427625  1.407924
a31  0.479579  1.427625  1.407924
a40  0.455510 -0.880937  1.375555
a41  0.455510 -0.880937  1.375555

If you want to fill the null values of the column with the column mean, then the following command can be utilized:

>>> df2.fillna(df2.mean())

            X         Y         Z
a0  -1.193371  0.912654 -0.780461
a1   0.547655  0.692852  0.681825
a10  1.413044  0.615997  0.947334
a11  0.547655  0.692852  0.681825
a20  1.583516  1.388921  0.458771
a21  0.547655  0.692852  0.681825
a30  0.479579  1.427625  1.407924
a31  0.547655  0.692852  0.681825
a40  0.455510 -0.880937  1.375555
a41  0.547655  0.692852  0.681825

String operations

Sometimes, you would want to modify the string field column in your data. The following technique explains some of the string operations:

  • Substring: Let's start by choosing the first five rows of the AREA NAME column in the data as our sample data to modify:

    >>> df = pd.read_csv('Data/Student_Weight_Status_Category_Reporting_Results__Beginning_2010.csv')
    >>> df['AREA NAME'][0:5]
    
    0    RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT
    1    RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT
    2    RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT
    3                        COHOES CITY SCHOOL DISTRICT
    4                        COHOES CITY SCHOOL DISTRICT
    Name: AREA NAME, dtype: object
    

    In order to extract the first word from the Area Name column, we'll use the extract function as shown in the following command:

    >>> df['AREA NAME'][0:5].str.extract('(\w+)')
    
    0    RAVENA
    1    RAVENA
    2    RAVENA
    3    COHOES
    4    COHOES
    Name: AREA NAME, dtype: object
    

    In the preceding command, the str attribute of the series is utilized. The str class contains an extract method, where a regular expression could be fed to extract data, which is very powerful. It is also possible to extract a second word in AREA NAME as a separate column:

    >>> df['AREA NAME'][0:5].str.extract('(\w+)\s(\w+)')
            0         1
    0  RAVENA  COEYMANS
    1  RAVENA  COEYMANS
    2  RAVENA  COEYMANS
    3  COHOES      CITY
    4  COHOES      CITY
    

    To extract data in different columns, the respective regular expression needs to be enclosed in separate parentheses.

  • Filtering: If we want to filter rows with data on ELEMENTARY school, then the following command can be used:

    >>> df[df['GRADE LEVEL'] == 'ELEMENTARY']
    
  • Uppercase: To convert the area name to uppercase, we'll use the following command:

    >>> df['AREA NAME'][0:5].str.upper()
    0    RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT
    1    RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT
    2    RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT
    3                        COHOES CITY SCHOOL DISTRICT
    4                        COHOES CITY SCHOOL DISTRICT
    Name: AREA NAME, dtype: object
    

    Since the data strings are in uppercase already, there won't be any difference seen.

  • Lowercase: To convert Area Name to lowercase, we'll use the following command:

    >>> df['AREA NAME'][0:5].str.lower()
    0    ravena coeymans selkirk central school district
    1    ravena coeymans selkirk central school district
    2    ravena coeymans selkirk central school district
    3                        cohoes city school district
    4                        cohoes city school district
    Name: AREA NAME, dtype: object
    
  • Length: To find the length of each element of the Area Name column, we'll use the following command:

    >>> df['AREA NAME'][0:5].str.len()
    0    47
    1    47
    2    47
    3    27
    4    27
    Name: AREA NAME, dtype: int64
    
  • Split: To split Area Name based on a whitespace, we'll use the following command:

    >>> df['AREA NAME'][0:5].str.split(' ')
    
    0    [RAVENA, COEYMANS, SELKIRK, CENTRAL, SCHOOL, D...
    1    [RAVENA, COEYMANS, SELKIRK, CENTRAL, SCHOOL, D...
    2    [RAVENA, COEYMANS, SELKIRK, CENTRAL, SCHOOL, D...
    3                     [COHOES, CITY, SCHOOL, DISTRICT]
    4                     [COHOES, CITY, SCHOOL, DISTRICT]
    Name: AREA NAME, dtype: object
    
  • Replace: If we want to replace all the area names ending with DISTRICT to DIST, then the following command can be used:

    >>> df['AREA NAME'][0:5].str.replace('DISTRICT$', 'DIST')
    
    0    RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DIST
    1    RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DIST
    2    RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DIST
    3                        COHOES CITY SCHOOL DIST
    4                        COHOES CITY SCHOOL DIST
    Name: AREA NAME, dtype: object
    

    The first argument in the replace method is the regular expression used to identify the portion of the string to replace. The second argument is the value for it to be replaced with.

Merging data

To combine datasets together, the concat function of pandas can be utilized. Let's take the Area Name and the County columns with its first five rows:

>>> d[['AREA NAME', 'COUNTY']][0:5]

                                 AREA NAME            COUNTY
0  RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT    ALBANY
1  RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT    ALBANY
2  RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT    ALBANY
3                      COHOES CITY SCHOOL DISTRICT    ALBANY
4                      COHOES CITY SCHOOL DISTRICT    ALBANY

We can divide the data as follows:

>>> p1 = d[['AREA NAME', 'COUNTY']][0:2]
>>> p2 = d[['AREA NAME', 'COUNTY']][2:5]

The first two rows of the data are in p1 and the last three rows are in p2. These pieces can be combined using the concat() function:

>>> pd.concat([p1,p2])

                                 AREA NAME            COUNTY
0  RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT    ALBANY
1  RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT    ALBANY
2  RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT    ALBANY
3                      COHOES CITY SCHOOL DISTRICT    ALBANY
4                      COHOES CITY SCHOOL DISTRICT    ALBANY

The combined pieces can be identified by assigning a key:

>>> concatenated = pd.concat([p1,p2], keys = ['p1','p2'])
>>> concatenated
                     AREA NAME           COUNTY
p1 0  RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT      ALBANY
    1  RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT     ALBANY
p2 2  RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT      ALBANY
    3                      COHOES CITY SCHOOL DISTRICT    ALBANY
    4                      COHOES CITY SCHOOL DISTRICT    ALBANY

Using the keys, the pieces can be extracted back from the concatenated data:

>>> concatenated.ix['p1']

                                        AREA NAME     COUNTY
0  RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT    ALBANY
1  RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT    ALBANY