Book Image

Data Science with Python

By : Rohan Chopra, Aaron England, Mohamed Noordeen Alaudeen
Book Image

Data Science with Python

By: Rohan Chopra, Aaron England, Mohamed Noordeen Alaudeen

Overview of this book

Data Science with Python begins by introducing you to data science and teaches you to install the packages you need to create a data science coding environment. You will learn three major techniques in machine learning: unsupervised learning, supervised learning, and reinforcement learning. You will also explore basic classification and regression techniques, such as support vector machines, decision trees, and logistic regression. As you make your way through the book, you will understand the basic functions, data structures, and syntax of the Python language that are used to handle large datasets with ease. You will learn about NumPy and pandas libraries for matrix calculations and data manipulation, discover how to use Matplotlib to create highly customizable visualizations, and apply the boosting algorithm XGBoost to make predictions. In the concluding chapters, you will explore convolutional neural networks (CNNs), deep learning algorithms used to predict what is in an image. You will also understand how to feed human sentences to a neural network, make the model process contextual information, and create human language processing systems to predict the outcome. By the end of this book, you will be able to understand and implement any new data science algorithm and have the confidence to experiment with tools or libraries other than those covered in the book.
Table of Contents (10 chapters)

Chapter 2: Data Visualization

Activity 2: Line Plot

Solution:

  1. Create a list of 6 strings for each month, January through June, and save it as x using:

    x = ['January','February','March','April','May','June']

  2. Create a list of 6 values for 'Items Sold' that starts at 1000 and increases by 200, so the final value is 2000 and save it as y as follows:

    y = [1000, 1200, 1400, 1600, 1800, 2000]

  3. Plot y ('Items Sold') by x ('Month') with a dotted blue line and star markers using the following:

    plt.plot(x, y, '*:b')

  4. Set the x-axis to 'Month' using the following code:

    plt.xlabel('Month')

  5. Set the y-axis to 'Items Sold' as follows:

    plt.ylabel('Items Sold')

  6. To set the title to read 'Items Sold has been Increasing Linearly', refer to the following code:

    plt.title('Items Sold has been Increasing Linearly')

    Check out the following screenshot for the resultant output:

Figure 2.33: Line plot of items sold by month
Figure 2.33: Line plot of items sold by month

Activity 3: Bar Plot

Solution:

  1. Create a list of five strings for x containing the names of NBA franchises with the most titles using the following code:

    x = ['Boston Celtics','Los Angeles Lakers', 'Chicago Bulls', 'Golden State Warriors', 'San Antonio Spurs']

  2. Create a list of five values for y containing values for 'Titles Won' that correspond with the strings in x using the following code:

    y = [17, 16, 6, 6, 5]

  3. Place x and y into a data frame with the column names 'Team' and 'Titles', respectively, as follows:

    import pandas as pd

     

    df = pd.DataFrame({'Team': x,

                       'Titles': y})

  4. To sort the data frame descending by 'Titles' and save it as df_sorted, refer to the following code:

    df_sorted = df.sort_values(by=('Titles'), ascending=False)

    Note

    If we sort with ascending=True, the plot will have larger values to the right. Since we want the larger values on the left, we will be using ascending=False.

  5. Make a programmatic title and save it as title by first finding the team with the most titles and saving it as the team_with_most_titles object using the following code:

    team_with_most_titles = df_sorted['Team'][0]

  6. Then, retrieve the number of titles for the team with the most titles using the following code:

    most_titles = df_sorted['Titles'][0]

  7. Lastly, create a string that reads 'The Boston Celtics have the most titles with 17' using the following code:

    title = 'The {} have the most titles with {}'.format(team_with_most_titles, most_titles)

  8. Use a bar graph to plot the number of titles by team using the following code:

    import matplotlib.pyplot as plt

    plt.bar(df_sorted['Team'], df_sorted['Titles'], color='red')

  9. Set the x-axis label to 'Team' using the following:

    plt.xlabel('Team')

  10. Set the y-axis label to 'Number of Championships' using the following:

    plt.ylabel('Number of Championships')

  11. To prevent the x tick labels from overlapping by rotating them 45 degrees, refer to the following code:

    plt.xticks(rotation=45)

  12. Set the title of the plot to the programmatic title object we created as follows:

    plt.title(title)

  13. Save the plot to our current working directory as 'Titles_by_Team.png' using the following code:

    plt.savefig('Titles_by_Team)

  14. Print the plot using plt.show(). To understand this better, check out the following output screenshot:
    Figure 2.34: The bar plot of the number of titles held by an NBA team
    Figure 2.34: The bar plot of the number of titles held by an NBA team

    Note

    When we print the plot to the console using plt.show(), it appears as intended; however, when we open the file we created titled 'Titles_by_Team.png', we see that it crops the x tick labels.

    The following figure displays the bar plot with the cropped x tick labels.

    Figure 2.35: 'Titles_by_Team.png' with x tick labels cropped
    Figure 2.35: 'Titles_by_Team.png' with x tick labels cropped
  15. To fix the cropping issue, add bbox_inches='tight' as an argument inside of plt.savefig() as follows:

    plt.savefig('Titles_by_Team', bbox_inches='tight')

  16. Now, when we open the saved 'Titles_by_Team.png' file from our working directory, we see that the x tick labels are not cropped.

    Check out the following output for the final result:

Figure 2.36: 'Titles_by_Team.png' without cropped x tick labels
Figure 2.36: 'Titles_by_Team.png' without cropped x tick labels

Activity 4: Multiple Plot Types Using Subplots

Solution:

  1. Import the 'Items_Sold_by_Week.csv' file and save it as the Items_by_Week data frame object using the following code:

    import pandas as pd

     

    Items_by_Week = pd.read_csv('Items_Sold_by_Week.csv')

  2. Import the 'Weight_by_Height.csv' file and save it as the Weight_by_Height data frame object as follows:

    Weight_by_Height = pd.read_csv('Weight_by_Height.csv')

  3. Generate an array of 100 normally distributed numbers to use as data for the histogram and box-and-whisker plots and save it as y using the following code:

    y = np.random.normal(loc=0, scale=0.1, size=100)

  4. To generate a figure with six subplots organized in three rows and two columns that do not overlap refer to the following code:

    import matplotlib.pyplot as plt

     

    fig, axes = plt.subplots(nrows=3, ncols=2)

    plt.tight_layout()

  5. Set the respective axes' titles to match those in Figure 2.32 using the following code:

    axes[0,0].set_title('Line')

    axes[0,1].set_title('Bar')

    axes[1,0].set_title('Horizontal Bar')

    axes[1,1].set_title('Histogram')

    axes[2,0].set_title('Scatter')

    axes[2,1].set_title('Box-and-Whisker')

    Figure 2.37: Titled, non-overlapping empty subplots
    Figure 2.37: Titled, non-overlapping empty subplots
  6. On the 'Line', 'Bar', and 'Horizontal Bar' axes, plot 'Items_Sold' by 'Week' from 'Items_by_Week' using:

    axes[0,0].plot(Items_by_Week['Week'], Items_by_Week['Items_Sold'])

    axes[0,1].bar(Items_by_Week['Week'], Items_by_Week['Items_Sold'])

    axes[1,0].barh(Items_by_Week['Week'], Items_by_Week['Items_Sold'])

    See the resultant output in the following figure:

    Figure 2.38: Line, bar, and horizontal bar plots added
    Figure 2.38: Line, bar, and horizontal bar plots added
  7. On the 'Histogram' and 'Box-and-Whisker' axes, plot the array of 100 normally distributed numbers using the following code:

    axes[1,1].hist(y, bins=20)axes[2,1].boxplot(y)

    The resultant output is displayed here:

    Figure 2.39: The histogram and box-and-whisker added
    Figure 2.39: The histogram and box-and-whisker added
  8. Plot 'Weight' by 'Height' on the 'Scatterplot' axes from the 'Weight_by_Height' data frame using the following code:

    axes[2,0].scatter(Weight_by_Height['Height'], Weight_by_Height['Weight'])

    See the figure here for the resultant output:

    Figure 2.40: Scatterplot added
    Figure 2.40: Scatterplot added
  9. Label the x- and y-axis for each subplot using axes[row, column].set_xlabel('X-Axis Label') and axes[row, column].set_ylabel('Y-Axis Label'), respectively.

    See the figure here for the resultant output:

    Figure 2.41: X and y axes have been labeled
    Figure 2.41: X and y axes have been labeled
  10. Increase the size of the figure with the figsize argument in the subplots function as follows:

    fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(8,8))

  11. Save the figure to the current working directory as 'Six_Subplots' using the following code:

    fig.savefig('Six_Subplots')

    The following figure displays the 'Six_Subplots.png' file:

Figure 2.42: The Six_Subplots.png file
Figure 2.42: The Six_Subplots.png file