Book Image

R Web Scraping Quick Start Guide

By : Olgun Aydin
Book Image

R Web Scraping Quick Start Guide

By: Olgun Aydin

Overview of this book

Web scraping is a technique to extract data from websites. It simulates the behavior of a website user to turn the website itself into a web service to retrieve or introduce new data. This book gives you all you need to get started with scraping web pages using R programming. You will learn about the rules of RegEx and Xpath, key components for scraping website data. We will show you web scraping techniques, methodologies, and frameworks. With this book's guidance, you will become comfortable with the tools to write and test RegEx and XPath rules. We will focus on examples of dynamic websites for scraping data and how to implement the techniques learned. You will learn how to collect URLs and then create XPath rules for your first web scraping script using rvest library. From the data you collect, you will be able to calculate the statistics and create R plots to visualize them. Finally, you will discover how to use Selenium drivers with R for more sophisticated scraping. You will create AWS instances and use R to connect a PostgreSQL database hosted on AWS. By the end of the book, you will be sufficiently confident to create end-to-end web scraping systems using R.
Table of Contents (7 chapters)

Advantages and disadvantages of using Selenium for web scraping

Because WebDriver uses a real web browser to access the web site, there is no difference than browsing the web by a human. When you navigate to a web page using WebDriver, the browser loads all the website resources (JavaScript files, images, css files, and so on) and executes all the JavaScripts on the page. It also keeps all cookies created by your websites. This makes it very difficult to determine whether a real person or a robot has accessed the website. With WebDriver, this can be done in a few simple steps, although it's really hard to simulate all these actions in a program that sends handmade HTTP requests to the server.

Sometimes, the data to be extracted may not be included in the raw HTML that was received after an HTTP request was made. Although it is possible to receive this data only with HTTP...