Book Image

Applied Data Science with Python and Jupyter

By : Alex Galea
Book Image

Applied Data Science with Python and Jupyter

By: Alex Galea

Overview of this book

Getting started with data science doesn't have to be an uphill battle. Applied Data Science with Python and Jupyter is a step-by-step guide ideal for beginners who know a little Python and are looking for a quick, fast-paced introduction to these concepts. In this book, you'll learn every aspect of the standard data workflow process, including collecting, cleaning, investigating, visualizing, and modeling data. You'll start with the basics of Jupyter, which will be the backbone of the book. After familiarizing ourselves with its standard features, you'll look at an example of it in practice with our first analysis. In the next lesson, you dive right into predictive analytics, where multiple classification algorithms are implemented. Finally, the book ends by looking at data collection techniques. You'll see how web data can be acquired with scraping techniques and via APIs, and then briefly explore interactive visualizations.
Table of Contents (6 chapters)

Chapter 3: Web Scraping and Interactive Visualizations


Activity 3: Web Scraping with Jupyter Notebooks

  1. For this page, the data can be scraped using the following code snippet:

    data = []
    for i, row in enumerate(soup.find_all('tr')): row_data = row.find_all('td')
    try:
    d1, d2, d3 = row_data[1], row_data[5], row_data[6] d1 = d1.find('a').text
    d2 = float(d2.text)
    d3 = d3.find_all('span')[1].text.replace('+', '') data.append([d1, d2, d3])
    except:
    print('Ignoring row {}'.format(i)
  2. In the lesson-3-workbook.ipynb Jupyter Notebook, scroll to Activity A: Web scraping with Python.

  3. Set the url variable and load an IFrame of our page in the notebook by running the following code:

    url = 'http://www.worldometers.info/world-population/ population-by-country/'
    IFrame(url, height=300, width=800)

    The page should load in the notebook. Scrolling down, we can see the Countries in the world by population heading and the table of values beneath it. We'll scrape the first three columns from this table to get the countries, populations, and yearly population changes.

  4. Close the IFrame by selecting the cell and clicking Current Outputs | Clear from the Cell menu in the Jupyter Notebook.

  5. Request the page and load it as a BeautifulSoup object by running the following code:

    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')

    We feed the page content to the BeautifulSoup constructor. Recall that previously, we used page.text here instead. The difference is that page.content returns the raw binary response content, whereas page.text returns the UTF-8 decoded content. It's usually best practice to pass the bytes object and let BeautifulSoup decode it, rather than doing it with Requests using page.text.

  6. Print the H1 for the page by running the following code:

    soup.find_all('h1')
    >> [<h1>Countries in the world by population (2017)</h1>]

    We'll scrape the table by searching for <th>, <tr>, and <td> elements, as in the previous exercise.

  7. Get and print the table headings by running the following code:

    table_headers = soup.find_all('th') table_headers
    >> [<th>#</th>,
    <th>Country (or dependency)</th>,
    <th>Population<br/> (2017)</th>,
    <th>Yearly<br/> Change</th>,
    <th>Net<br/> Change</th>,
    <th>Density<br/> (P/Km²)</th>,
    <th>Land Area<br/> (Km²)</th>,
    <th>Migrants<br/> (net)</th>,
    <th>Fert.<br/> Rate</th>,
    <th>Med.<br/> Age</th>,
    <th>Urban<br/> Pop %</th>,
    <th>World<br/> Share</th>]
  8. We are only interested in the first three columns. Select these and parse the text with the following code:

    table_headers = table_headers[1:4] table_headers = [t.text.replace('\n', '') for t in table_ headers]

    After selecting the subset of table headers we want, we parse the text content from each and remove any newline characters.

    Now, we'll get the data. Following the same prescription as the previous exercise, we'll test how to parse the data for a sample row.

  9. Get the data for a sample row by running the following code:

    row_number = 2
    row_data = soup.find_all('tr')[row_number]\
    .find_all('td') 
  10. How many columns of data do we have? Print the length of row_data by running print(len(row_data)).

  11. Print the first elements by running print(row_data[:4]):

    >> [<td>2</td>,
    <td style="font-weight: bold; font-size:15px; text-align:left"><a href="/world-population/india- population/">India</a></td>,
    <td style="font-weight: bold;">1,339,180,127</td>,
    <td>1.13 %</td>]

    It's pretty obvious that we want to select list indices 1, 2, and 3. The first data value can be ignored, as it's simply the index.

  12. Select the data elements we're interested in parsing by running the following code:

    d1, d2, d3 = row_data[1:4]
  13. Looking at the row_data output, we can find out how to correctly parse the data. We'll want to select the content of the <a> element in the first data element, and then simply get the text from the others. Test these assumptions by running the following code:

    print(d1.find('a').text) print(d2.text) print(d3.text)
    >> India
    >> 1,339,180,127
    >> 1.13 %

    Excellent! This looks to be working well. Now, we're ready to scrape the entire table.

  14. Scrape and parse the table data by running the following code:

    data = []
    for i, row in enumerate(soup.find_all('tr')): try:
    d1, d2, d3 = row.find_all('td')[1:4] d1 = d1.find('a').text
    d2 = d2.text d3 = d3.text
    data.append([d1, d2, d3]) except:
    print('Error parsing row {}'.format(i))
    >> Error parsing row 0

    This is quite similar to before, where we try to parse the text and skip the row if there's some error.

  15. Print the head of the scraped data by running print(data[:10]):

    >> [['China', '1,409,517,397', '0.43 %'],
    ['India', '1,339,180,127', '1.13 %'],
    ['U.S.', '324,459,463', '0.71 %'],
    ['Indonesia', '263,991,379', '1.10 %'],
    ['Brazil', '209,288,278', '0.79 %'],
    ['Pakistan', '197,015,955', '1.97 %'],
    ['Nigeria', '190,886,311', '2.63 %'],
    ['Bangladesh', '164,669,751', '1.05 %'],
    ['Russia', '143,989,754', '0.02 %'],
    ['Mexico', '129,163,276', '1.27 %']]

    It looks like we have managed to scrape the data! Notice how similar the process was for this table compared to the Wikipedia one, even though this web page is completely different. Of course, it will not always be the case that data is contained within a table, but regardless, we can usually use find_all as the primary method for parsing.

  16. Finally, save the data to a CSV file for later use. Do this by running the following code:

    f_path = '../data/countries/populations.csv' with open(f_path, 'w') as f:
    f.write('{};{};{}\n'.format(*table_headers)) for d in data:
    f.write('{};{};{}\n'.format(*d))

Activity 4: Exploring Data with Interactive Visualizations

  1. In the lesson-3-workbook.ipynb file, scroll to the Activity B: Interactive visualizations with Bokeh section.

  2. Load the previously scraped, merged, and cleaned web page data by running the following code:

    df = pd.read_csv('../data/countries/merged.csv')
    df['Date of last change'] = pd.to_datetime(df['Date of last change'])
  3. Recall what the data looks like by displaying the DataFrame:

    Figure 3.18: Output of the data within DataFrame

    Whereas in the previous exercise we were interested in learning how Bokeh worked, now we are interested in what this data looks like. In order to explore this dataset, we are going to use interactive visualizations.

  4. Draw a scatter plot of the population as a function of the interest rate by running the following code:

    source = ColumnDataSource(data=dict( x=df['Interest rate'], y=df['Population'], desc=df['Country'],
    ))
    hover = HoverTool(tooltips=[ ('Country', '@desc'),
    ('Interest Rate (%)', '@x'), ('Population', '@y')
    ])
    tools = [hover, PanTool(), BoxZoomTool(), WheelZoomTool(), ResetTool()]
    p = figure(tools=tools,
    x_axis_label='Interest Rate (%)', y_axis_label='Population')
    p.circle('x', 'y', size=10, alpha=0.5, source=source) show(p)

    Figure 3.19: Scatter plot of population and interest rate

    This is quite similar to the final examples we looked at when introducing Bokeh in the previous exercise. We set up a customized data source with the x and y coordinates for each point, along with the country name. This country name is passed to the Hover Tool, so that it's visible when hovering the mouse over the dot. We pass this tool to the figure, along with a set of other useful tools.

  5. In the data, we see some clear outliers with high populations. Hover over these to see what they are:

    Figure 3.20: Labels obtained by hovering over data points

    We see they belong to India and China. These countries have fairly average interest rates. Let's focus on the rest of the points by using the Box Zoom tool to modify the view window size.

  6. Select the Box Zoom tool and alter the viewing window to better see the majority of the data:

    Figure 3.21: The Box Zoom tool

    Figure 3.22: Scatter plot with majority of the data points within the box

    Explore the points and see how the interest rates compare for various countries. What are the countries with the highest interest rates?:

    Figure 3.23: Hovering over data points to view detailed data

  7. Some of the lower population countries appear to have negative interest rates. Select the Wheel Zoom tool and use it to zoom in on this region. Use the Pan tool to re-center the plot, if needed, so that the negative interest rate samples are in view. Hover over some of these and see what countries they correspond to:

    Figure 3.24: Screen shot of the Wheel Zoom tool

    Figure 3.25: Data points of negative interest rates countries

    Let's re-plot this, adding a color based on the date of last interest rate change. This will be useful to search for relations between the date of last change and the interest rate or population size.

  8. Add a Year of last change column to the DataFrame by running the following code:

    def get_year(x):
    year = x.strftime('%Y')
    if year in ['2018', '2017', '2016']:
    return year else: return 'Other'
    df['Year of last change'] = df['Date of last change']. apply(get_year)
  9. Create a map to group the last change date into color categories by running the following code:

    year_to_color = { '2018': 'black',
    '2017': 'blue',
    '2016': 'orange',
    'Other':'red'
    }

    Once mapped to the Year of last change column, this will assign values to colors based on the available categories: 2018, 2017, 2016, and Other. The colors here are standard strings, but they could alternatively by represented by hexadecimal codes.

  10. Create the colored visualization by running the following code:

    source = ColumnDataSource(data=dict( x=df['Interest rate'],
    ...
    ...
    fill_color='colors', line_color='black', legend='label')
    show(p)

    Note

    For the complete code, refer to the following: https://bit.ly/2Si3K04

    Figure 3.26: Visualization obtained after assigning values to colors

    There are some technical details that are important here. First of all, we add the colors and labels for each point to the ColumnDataSource. These are then referenced when plotting the circles by setting the fill_color and legend arguments.

  11. Looking for patterns, zoom in on the lower population countries:

    Figure 3.27: A zoomed in view of the lower population countries

    We can see how the dark dots are more prevalent to the right-hand side of the plot. This indicates that countries that have higher interest rates are more likely to have been recently updated.

    The one data column we have not yet looked at is the year-over-year change in population. Let's visualize this compared to the interest rate and see if there is any trend. We'll also enhance the plot by setting the circle size based on the country population.

  12. Plot the interest rate as a function of the year-over-year population change by running the following code:

    source = ColumnDataSource(data=dict( x=df['Yearly Change'],
    ...
    ...
    p.circle('x', 'y', size=10, alpha=0.5, source=source, radius='radii')
    show(p)

    Figure 3.28: Plotting interest rate as a function of YoY population change

    Here, we use the square root of the population for the radii, making sure to also scale down the result to a good size for the visualization.

    We see a strong correlation between the year-over-year population change and the interest rate. This correlation is especially strong when we take the population sizes into account, by looking primarily at the bigger circles. Let's add a line of best fit to the plot to illustrate this correlation.

    We'll use scikit-learn to create the line of best fit, using the country populations (as visualized in the preceding plot) as weights.

  13. Determine the line of best fit for the previously plotted relationship by running the following code:

    from sklearn.linear_model import LinearRegression X = df['Yearly Change'].values.reshape(-1, 1)
    y = df['Interest rate'].values
    weights = np.sqrt(df['Population'])/1e5
    lm = LinearRegression()
    lm.fit(X, y, sample_weight=weights)
    lm_x = np.linspace(X.flatten().min(), X.flatten().max(), 50)
    lm_y = lm.predict(lm_x.reshape(-1, 1))

    The scikit-learn code should be familiar from earlier in this book. As promised, we are using the transformed populations, as seen in the previous plot, as the weights. The line of best fit is then calculated by predicting the linear model values for a range of x values.

    To plot the line, we can reuse the preceding code, adding an extra call to the line module in Bokeh. We'll also have to set a new data source for this line.

  14. Re-plot the preceding figure, adding a line of best fit, by running the following code:

    source = ColumnDataSource(data=dict( x=df['Yearly Change'], y=df['Interest rate'],
    ...
    ...
    p.line('x', 'y', line_width=2, line_color='red', source=lm_source)
    show(p)

    Figure 3.29: Adding a best fit line to the plot of YoY population change and interest rates

    For the line source, lm_source, we include N/A as the country name and population, as these are not applicable values for the line of best fit. As can be seen by hovering over the line, they indeed appear in the tooltip.

    The interactive nature of this visualization gives us a unique opportunity to explore outliers in this dataset, for example, the tiny dot in the lower-right corner.

  15. Explore the plot by using the zoom tools and hovering over interesting samples. Note the following:

    Ukraine has an unusually high interest rate, given the low year-over-year population change:

    Figure 3.30: Using the Zoom tool to explore the data for Ukraine

    The small country of Bahrain has an unusually low interest rate, given the high year-over-year population change:

    Figure 3.31: Using the Zoom tool to explore the data for Bahrain