Book Image

Hands-On Web Scraping with Python - Second Edition

By : Anish Chapagain
Book Image

Hands-On Web Scraping with Python - Second Edition

By: Anish Chapagain

Overview of this book

Web scraping is a powerful tool for extracting data from the web, but it can be daunting for those without a technical background. Designed for novices, this book will help you grasp the fundamentals of web scraping and Python programming, even if you have no prior experience. Adopting a practical, hands-on approach, this updated edition of Hands-On Web Scraping with Python uses real-world examples and exercises to explain key concepts. Starting with an introduction to web scraping fundamentals and Python programming, you’ll cover a range of scraping techniques, including requests, lxml, pyquery, Scrapy, and Beautiful Soup. You’ll also get to grips with advanced topics such as secure web handling, web APIs, Selenium for web scraping, PDF extraction, regex, data analysis, EDA reports, visualization, and machine learning. This book emphasizes the importance of learning by doing. Each chapter integrates examples that demonstrate practical techniques and related skills. By the end of this book, you’ll be equipped with the skills to extract data from websites, a solid understanding of web scraping and Python programming, and the confidence to use these skills in your projects for analysis, visualization, and information discovery.
Table of Contents (20 chapters)
1
Part 1:Python and Web Scraping
4
Part 2:Beginning Web Scraping
8
Part 3:Advanced Scraping Concepts
13
Part 4:Advanced Data-Related Concepts
16
Part 5:Conclusion

Understanding the latest web technologies

A web page is not only a document or container of content. The rapid development in computing and web-related technologies today has transformed the web, with more security features being implemented and the web becoming a dynamic, real-time source of information. Many scraping communities gather historic data; some analyze hourly data or the latest obtained data.

At our end, we (users) use web browsers (such as Google Chrome, Mozilla Firefox, and Safari) as an application to access information from the web. Web browsers provide various document-based functionalities to users and contain application-level features that are often useful to web developers.

Web pages that users view or explore through their browsers are not just single documents. Various technologies exist that can be used to develop websites or web pages. A web page is a document that contains blocks of HTML tags. Most of the time, it is built with various sub-blocks linked as dependent or independent components from various interlinked technologies, including JavaScript and Cascading Style Sheets (CSS).

An understanding of the general concepts of web pages and the techniques of web development, along with the technologies found inside web pages, will provide more flexibility and control in the scraping process. A lot of the time, a developer can also employ reverse-engineering techniques.

Reverse engineering is an activity that involves breaking down and examining the concepts that were required to build certain products. For more information on reverse engineering, please refer to the GlobalSpec article How Does Reverse Engineering Work?, available at https://insights.globalspec.com/article/7367/how-does-reverse-engineering-work.

Here, we will introduce and explore a few of the available web technologies that can help and guide us in the process of data extraction.

HTTP

Hypertext Transfer Protocol (HTTP) is an application protocol that transfers resources (web-based), such as HTML documents, between a client and a web server. HTTP is a stateless protocol that follows the client-server model. Clients (web browsers) and web servers communicate or exchange information using HTTP requests and HTTP responses, as seen in Figure 1.2:

Figure 1.2: HTTP (client and server or request-response communication)

Figure 1.2: HTTP (client and server or request-response communication)

Requests and responses are cyclic in nature – they are like questions and answers from clients to the server, and vice versa.

Another encrypted and more secure version of the HTTP protocol is Hypertext Transfer Protocol Secure (HTTPS). It uses Secure Sockets Layer (SSL) (learn more about SSL at https://developer.mozilla.org/en-US/docs/Glossary/SSL) and Transport Layer Security (TLS) (learn more about TLS at https://developer.mozilla.org/en-US/docs/Glossary/TLS) to communicate encrypted content between a client and a server. This type of security allows clients to exchange sensitive data with a server in a safe manner. Activities such as banking, online shopping, and e-payment gateways use HTTPS to make sensitive data safe and prevent it from being exposed.

Important note

An HTTP request URL begins with http://, for example, http://www.packtpub.com, and an HTTPS request URL begins with https://, such as https://www.packpub.com.

You have now learned a bit about HTTP. In the next section, you will learn about HTTP requests (or HTTP request methods).

HTTP requests (or HTTP request methods)

Web browsers or clients submit their requests to the server. Requests are forwarded to the server using various methods (commonly known as HTTP request methods), such as GET and POST:

  • GET: This is the most common method for requesting information. It is considered a safe method as the resource state is not altered here. Also, it is used to provide query strings, such as https://www.google.com/search?q=world%20cup%20football&source=hp, which is requesting information from Google based on the q (world cup football) and source (hp) parameters sent with the request. Information or queries (q and source in this example) with values are displayed in the URL.
  • POST: Used to make a secure request to the server. The requested resource state can be altered. Data posted or sent to the requested URL is not visible in the URL but rather transferred with the request body. It is used to submit information to the server in a secure way, such as for logins and user registrations.

We will explore more about HTTP methods in the Implementing HTTP methods section of Chapter 2.

There are two main parts to HTTP communication, as seen in Figure 1.2. With a basic idea about HTTP requests, let’s explore HTTP responses in the next section.

HTTP responses

The server processes the requests, and sometimes also the specified HTTP headers. When requests are received and processed, the server returns its response to the browser. Most of the time, responses are found in HTML format, or even, in JavaScript and other document types, in JavaScript Object Notation (JSON) or other formats.

A response contains status codes, the meaning of which can be revealed using Developer Tools (DevTools). The following list contains a few status codes along with some brief information about what they mean:

  • 200: OK, request succeeded
  • 404: Not found, requested resource cannot be found
  • 500: Internal server error
  • 204: No content to be sent
  • 401: Unauthorized request was made to the server

There are also some groups of responses that can be identified from a range of HTTP response statuses:

  • 100–199: Informational responses
  • 200–299: Successful responses
  • 300–399: Redirection responses
  • 400–499: Client error
  • 500–599: Server error

Important note

For more information on cookies, HTTP, HTTP responses, and status codes, please consult the official documentation at https://www.w3.org/Protocols/ and https://developer.mozilla.org/en-US/docs/Web/HTTP/Status.

Now that we have a basic idea about HTTP responses and requests, let us explore HTTP cookies (one of the most important factors in web scraping).

HTTP cookies

HTTP cookies are data sent by the server to the browser. This data is generated and stored by websites on your system or computer. It helps to identify HTTP requests from the user to the website. Cookies contain information regarding session management, user preferences, and user behavior.

The server identifies and communicates with the browser based on the information stored in the cookies. Data stored in cookies helps a website to access and transfer certain saved values, such as the session ID and expiration date and time, providing a quick interaction between the web request and response.

Figure 1.3 displays the list of request cookies from https://www.fifa.com/fifaplus/en, collected using Chrome DevTools:

Figure 1.3: Request cookies

Figure 1.3: Request cookies

We will explore and collect more information about and from browser-based DevTools in the upcoming sections and Chapter 3.

Important note

For more information about cookies, please visit About Cookies at http://www.aboutcookies.org/ and All About Cookies at http://www.allaboutcookies.org/.

Similar to the role of cookies, HTTP proxies are also quite important in scraping. We will explore more about proxies in the next section, and also in some later chapters.

HTTP proxies

A proxy server acts as an intermediate server between a client and the main web server. The web browser sends requests to the server that are actually passed through the proxy, and the proxy returns the response from the server to the client.

Proxies are often used for monitoring/filtering, performance improvement, translation, and security for internet-related resources. Proxies can also be bought as a service, which may also be used to deal with cross-domain resources. There are also various forms of proxy implementation, such as web proxies (which can be used to bypass IP blocking), CGI proxies, and DNS proxies.

You can buy or have a contract with a proxy seller or a similar organization. They will provide you with various types of proxies according to the country in which you are operating. Proxy switching during crawling is done frequently – a proxy allows us to bypass restricted content too. Normally, if a request is routed through a proxy, our IP is somewhat safe and not revealed as the receiver will just see the third-party proxy in their detail or server logs. You can even access sites that aren’t available in your location (that is, you see an access denied in your country message) by switching to a different proxy.

Cookie-based parameters that are passed in using HTTP GET requests, HTML form-related HTTP POST requests, and modifying or adapting headers will be crucial in managing code (that is, scripts) and accessing content during the web scraping process.

Important note

Details on HTTP, headers, cookies, and so on will be explored more in an upcoming section, Data-finding techniques used in web pages. Please visit the HTTP page in the MDN web docs (https://developer.mozilla.org/en-US/docs/Web/HTTP) for more detailed information on HTTP and related concepts. Please visit https://www.softwaretestinghelp.com/best-proxy-server/ for information on the best proxy server.

You now understand general concepts regarding HTTP (including requests, responses, cookies, and proxies). Next, we will understand the technology that is used to create web content or make content available in some predefined formats.

HTML

Websites are made up of pages or documents containing text, images, style sheets, and scripts, among other things. They are often built with markup languages such as Hypertext Markup Language (HTML) and Extensible Hypertext Markup Language (XHTML).

HTML is often referred to as the standard markup language used for building a web page. Since the early 1990s, HTML has been used independently as well as in conjunction with server-based scripting languages, such as PHP, ASP, and JSP. XHTML is an advanced and extended version of HTML, which is the primary markup language for web documents. XHTML is also stricter than HTML, and from a coding perspective, is also known as an application built with Extensible Markup Language (XML).

HTML defines and contains the content of a web page. Data that can be extracted, and any information-revealing data sources, can be found inside HTML pages within a predefined instruction set or markup elements called tags. HTML tags are normally a named placeholder carrying certain predefined attributes, for example, <a>, <b>, <table>, <img>, and <script>.

HTML is a container or type of markup language. Various factors are involved in building HTML; the next section defines these factors with some examples.

HTML elements and attributes

HTML elements (also referred to as document nodes) are the building blocks of web documents. HTML elements are built with a start tag, <..>, and an end tag, </..>, with certain content inside them. An HTML element can also contain attributes, usually defined as attribute-name = attribute-value, which provide additional information to the element:

<p>normal paragraph tags</p>
<h1>heading tags there are also h2, h3, h4, h5, h6</h1>
<a href="https://www.google.com">Click here for Google.com</a>
<img src="myphoto1.jpg" width="300" height="300" alt="Picture" />
<br />

The preceding code can be broken down as follows:

  • <p> and <h1> are HTML elements containing general text information (element content).
  • <a> is defined with an href attribute that contains the actual link that will be processed when the text Click here for Google.com is clicked. The link refers to https://www.google.com/.
  • The <img> image tag also contains a few attributes, such as src and alt, along with their respective values. src holds the resource, which means the image address or image URL, as a value, whereas alt holds the value for alternative text (mostly displayed when there is a slow connection or the image is not able to load) for <img>.
  • <br/> represents a line break in HTML and has no attributes or text content. It is used to insert a new line in the layout of the document.

HTML elements can also be nested in a tree-like structure with a parent-child hierarchy, as follows:

<div class="article">
  <p id="mainContent" class="content">
    <b>Paragraph Content</b>
      <img src="mylogo.png" id="pageLogo" alt="Logo"
        class="logo"/>
  </p>
  <p>
    <h3> Paragraph Title: Web Scraping</h3>
  </p>
</div>

As seen in the preceding code, two <p> child elements are found inside an HTML <div> block. Both child elements carry certain attributes and various child elements as their content. Normally, HTML documents are built with the aforementioned structure.

As seen in the preceding code block in the last example, there are a few extra key-value pairs. The next section explores this.

Global attributes

HTML elements can contain some additional information, such as key-value pairs. These are also known as HTML element attributes. Attributes hold values and provide identification, or contain additional information that can be helpful in many aspects during scraping activities, such as identifying exact web elements and extracting values or text from them and traversing (moving along) elements.

There are certain attributes that are common to HTML elements or can be applied to all HTML elements. The following list mentions some of the attributes that are identified as global attributes (https://developer.mozilla.org/en-US/docs/Web/HTML/Global_attributes):

  • id: This attribute’s values should be unique to the element they are applied to
  • class: This attribute’s values are mostly used with CSS, providing equal state formatting options, and can be used with multiple elements
  • style: This specifies inline CSS styles for an element
  • lang: This helps to identify the language of the text

Important note

The id and class attributes are mostly used to identify or format individual elements or groups of them. These attributes can also be managed by CSS and other scripting languages. These attributes can be identified by placing # and ., respectively, in front of the attribute name when used with CSS, or while traversing and applying parsing techniques.

HTML element attributes can also be overwritten or implemented dynamically using scripting languages. As displayed in the following example, itemprop attributes are used to add properties to an element, whereas data-* is used to store data that is native to the element itself:

<div itemscope itemtype="http://schema.org/Place">
   <h1 itemprop="univeristy">University of Helsinki</h1>
   <span>Subject: <span itemprop="subject1">Artificial
      Intelligence</span>
   </span><span itemprop="subject2">Data Science</span>
</div>
<img class="dept" src="logo.png" data-course-id="324" datatitle="Predictive  Analysis" data-x="12345" data-y="54321" data-z="56743" onclick="schedule.load()"/>

HTML tags and attributes are very helpful when extracting data.

Important note

Please visit https://www.w3.org or https://www.w3schools.com/html for more detailed information on HTML.

In Chapter 3, we will explore these attributes using different tools. We will also perform various logical operations and use them for extracting or scraping purposes.

We now have some idea about HTML and a few important attributes related to HTML. In the next section, we will learn the basics of XML, also known as the parent of markup languages.

XML

XML is a markup language used for distributing data over the internet, with a set of rules for encoding documents that are readable and easily exchangeable between machines and documents. XML files are recognized by the .xml extension.

XML emphasizes the usability of textual data across various formats and systems. XML is designed to carry portable data or data stored in tags that is not predefined with HTML tags. In XML documents, tags are created by the document developer or an automated program to describe the content.

The following code displays some example XML content:

<employees>
  <employee>
    <fullName>Shiba Chapagain</fullName>
    <gender>Female</gender>
  </employee>
  <employee>
    <fullName>Aasira Chapagain</fullName>
    <gender>Female</gender>
  </employee>
</employees>

In the preceding code, the <employees> parent node has two <employee> child nodes, which in turn contain the other child nodes of <fullName> and <gender>.

XML is an open standard, using the Unicode character set. XML is used to share data across various platforms and has been adopted by various web applications. Many websites use XML data, implementing its contents with the use of scripting languages and presenting it in HTML or other document formats for the end user to view.

Extraction tasks from XML documents can also be performed to obtain the contents in the desired format, or by filtering the requirement with respect to a specific need for data. Plus, behind-the-scenes data may also be obtained from certain websites only.

Important note

Please visit https://www.w3.org/XML/ and https://www.w3schools.com/xml/ for more information on XML.

So far, we have explored content placing and content holding related technologies based on markup languages such as HTML and XML. These technologies are somewhat static in nature. The next section is about JavaScript, which provides dynamism to the web with the help of scripts.

JavaScript

JavaScript (also known as JS or JScript) is a programming language used to program HTML and web applications that run in the browser. JavaScript is mostly preferred for adding dynamic features and providing user-based interaction inside web pages. JavaScript, HTML, and CSS are among the most-used web technologies, and now they are also used with headless browsers (you can read more about headless browsers at https://oxylabs.io/blog/what-is-headless-browser). The client-side availability of the JavaScript engine has also strengthened its usage in application testing and debugging.

<script> contains programming logic with JavaScript variables, operators, functions, arrays, loops, conditions, and events, targeting the HTML Document Object Model (DOM). JavaScript code can be added to HTML using <script>, as seen in the following code, or can also be embedded as a file:

<!DOCTYPE html>
<html>
<head>
   <script>
      function placeTitle() {
         document.getElementById("innerDiv").innerHTML =
            "Welcome to WebScraping";
      }
   </script>
</head>
<body>
   <div>Press the button: <p id="innerDiv"></p></div>
   <button id="btnTitle" name="btnTitle" type="submit"
      onclick="placeTitle()">
      Load Page Title!
   </button>
</body>
</html>

As seen in the preceding code, the HTML <head> tag contains <script> with the placeTitle() JavaScript function. The function defined fires up the event as soon as <button> is clicked and changes the content for <p> with id=innerDIV (this particular element is defined as empty) to display the text Welcome to WebScraping.

Important note

The HTML DOM is a standard for how to get, change, add, or delete HTML elements. Please visit the page on JavaScript HTML DOM on W3Schools (https://www.w3schools.com/js/js_htmldom.asp) for more detailed information.

The dynamic manipulation of HTML content, elements, attribute values, CSS, and HTML events with accessible internal functions and programming features makes JavaScript very popular in web development. There are many web-based technologies related to JavaScript, including JSON, JavaScript Query (jQuery), AngularJS, and Asynchronous JavaScript and XML (AJAX), among many more. Some of these will be discussed in the following subsections.

jQuery

jQuery, or more specifically JavaScript-based DOM-related query, is a JavaScript library that addresses incompatibilities across browsers, providing API features to handle the HTML DOM, events, and animations. jQuery has been acclaimed globally for providing interactivity to the web and the way JavaScript is used to code. jQuery is lightweight in comparison to the JavaScript framework. It is also easy to implement and takes a short and readable coding approach.

jQuery is a huge topic and will require adequate knowledge of JavaScript before embarking on it. A jQuery-like Python-based library will be used by us in Chapter 4.

Important note

For more information on jQuery, please visit https://www.w3schools.com/jquery/ and http://jquery.com/.

jQuery is mostly used for DOM-based activities, as discussed in this section, whereas AJAX is a collection of technologies, which we are going to learn about in the next section.

AJAX

AJAX is a web development technique that uses a group of web technologies on the client side to create asynchronous web applications.

JavaScript XMLHttpRequest (XHR) objects are used to execute AJAX on web pages and load page content without refreshing or reloading the page. Please visit the AJAX page on W3Schools (https://www.w3schools.com/js/js_ajax_intro.asp) for more information on AJAX. From a scraping point of view, a basic overview of JavaScript functionality will be valuable to understand how a page is built or manipulated, as well as to identify the dynamic components used.

Important note

Please visit https://developer.mozilla.org/en-US/docs/Web/JavaScript, https://www.javascript.com/, https://www.w3schools.com/js/js_intro.asp, and https://www.w3schools.com/js/js_ajax_intro.asp for more information on JavaScript and AJAX.

We have learned about a few JavaScript-based techniques and technologies that are commonly deployed in web development today. In the next section, we will learn about data-storing objects.

JSON

JSON is a format used for storing and transporting data from a server to a web page. It is language-independent and preferred in web-based data interchange actions due to its size and readability. JSON files are files that have the .json extension.

JSON data is normally formatted as a name:value pair, which is evaluated as a JavaScript object and follows JavaScript operations. JSON and XML are often compared, as they both carry and exchange data between various web resources. JSON is usually ranked higher than XML for its structure, which is simple, readable, self-descriptive, understandable, and easy to process.

For web applications using JavaScript, AJAX, or RESTful services, JSON is preferred over XML due to its fast and easy operation. JSON and JavaScript objects are interchangeable. JSON is not a markup language, and it doesn’t contain any tags or attributes. Instead, it is a text-only format that can be accessed through a server, as well as being able to be managed by any programming language.

JSON objects can also be expressed as arrays, dictionaries, and lists:

{"mymembers":[
{ "firstName":"Aasira", "lastName":"Chapagain","cityName":"Kathmandu"},
{ "firstName":"Rakshya", "lastName":"Dhungel","cityName":"New Delhi"},
{ "firstName":"Shiba", "lastName":"Paudel","cityName":"Biratnagar"},
]}

You have learned about JSON, which is a content holder. In the following section, we will discuss HTML styling using CSS and providing HTML tags with extra identification.

Important note

JSON is also known for the mixture of dictionary and list objects it provides in Python. JSON is written as a string, and we can find plenty of websites that convert JSON strings into JSON objects, for example, https://jsonformatter.org/, https://jsonlint.com/, and https://www.freeformatter.com/json-formatter.html.

Please visit http://www.json.org/, https://jsonlines.org/, and https://www.w3schools.com/js/js_json_intro.asp for more information regarding JSON and JSON Lines.

CSS

The web-based technologies we have introduced so far deal with content, including binding, development, and processing. CSS describes the display properties of HTML elements and the appearance of web pages. CSS is used for styling and providing the desired appearance and presentation of HTML elements.

By using CSS, developers/designers can control the layout and presentation of a web document. CSS can be applied to a distinct element in a page, or it can be embedded through a separate document. Styling details can be described using the <style> tag.

The <style> tag can contain details targeting repeated and various elements in a block. As seen in the following code, multiple <a> elements exist, and it also possesses the class and id global attributes:

<html>
<head>
<style>
a{color:blue;}
h1{color:black; text-decoration:underline;}
#idOne{color:red;}
.classOne{color:orange;}
</style>
</head>
<body>
<h1> Welcome to Web Scraping </h1>Links:<a href="https://www.google.com"> Google </a> &nbsp;
<a class='classOne' href="https://www.yahoo.com"> Yahoo </a>
<a id='idOne' href="https://www.wikipedia.org"> Wikipedia </a>
</body>
</html>

Attributes that are provided with CSS properties or have been styled inside <style> tags in the preceding code block will result in the output shown in Figure 1.4:

Figure 1.4: Output of the HTML code using CSS

Figure 1.4: Output of the HTML code using CSS

Although CSS is used to manage the appearance of HTML elements, CSS selectors (patterns used to select elements or the position of elements) often play a major role in the scraping process. We will be exploring CSS selectors in detail in Chapter 3.

Important note

Please visit https://www.w3.org/Style/CSS/ and https://www.w3schools.com/css/ for more detailed information on CSS.

In this section, you were introduced to some of the technologies that can be used for web scraping. In the upcoming section, you will learn about data-finding techniques. Most of them are built with web technologies you have already been introduced to.