Book Image

R Web Scraping Quick Start Guide

By : Olgun Aydin
Book Image

R Web Scraping Quick Start Guide

By: Olgun Aydin

Overview of this book

Web scraping is a technique to extract data from websites. It simulates the behavior of a website user to turn the website itself into a web service to retrieve or introduce new data. This book gives you all you need to get started with scraping web pages using R programming. You will learn about the rules of RegEx and Xpath, key components for scraping website data. We will show you web scraping techniques, methodologies, and frameworks. With this book's guidance, you will become comfortable with the tools to write and test RegEx and XPath rules. We will focus on examples of dynamic websites for scraping data and how to implement the techniques learned. You will learn how to collect URLs and then create XPath rules for your first web scraping script using rvest library. From the data you collect, you will be able to calculate the statistics and create R plots to visualize them. Finally, you will discover how to use Selenium drivers with R for more sophisticated scraping. You will create AWS instances and use R to connect a PostgreSQL database hosted on AWS. By the end of the book, you will be sufficiently confident to create end-to-end web scraping systems using R.
Table of Contents (7 chapters)

Learning about data on the internet

Data is an essential part of any research, whether it be academic, marketing, or scientific . The World Wide Web (WWW) contains all kinds of information from different sources. Some of these are social, financial, security, and academic resources and are accessible via the internet.

People may want to collect and analyse data from multiple websites. These different websites that belong to specific categories display information in different formats. Even with a single website, you may not be able to see all the data at once. The data may be spanned across multiple pages under various sections.

Most websites do not allow you to save a copy of the data to your local storage. The only option is to manually copy and paste the data shown by the website to a local file in your computer. This is a very tedious process that can take lot of time.

Web scraping is a technique by which people can extract data from multiple websites to a single spreadsheet or database so that it becomes easier to analyse or even visualize the data. Web scraping is used to transform unstructured data from the network into a centralized local database.

Well-known companies, including Google, Amazon, Wikipedia, Facebook, and many more, provide APIs (Application Programming Interfaces) that contain object classes that facilitate interaction with variables, data structures, and other software components. In this way, data collection from those websites is fast and can be performed without any web scraping software.

One of the most used features when performing web scraping of the semi-structured of web pages are naturally rooted trees that are labeled. On this trees, the tags represent the appropriate labels for the HTML markup language syntax, and the tree hierarchy represents the different nesting levels of the elements that make up the web page. The display of a web page using an ordered rooted tree labeled with a label is referred to as the DOM (Document Object Model), which is largely edited by the WWW Consortium.

The general idea behind the DOM is to represent HTML web pages via plain text with HTML tags, with custom key words defined in the sign language. This can be interpreted by the browser to represent web-specific items. HTML tags can be placed in a hierarchical structure. In this hierarchy, nodes in the DOM are captured by the document tree that represents the HTML tags. We will take a look at DOM structures while we focus on XPath rules.