Book Image

Go Web Scraping Quick Start Guide

By : Vincent Smith
Book Image

Go Web Scraping Quick Start Guide

By: Vincent Smith

Overview of this book

Web scraping is the process of extracting information from the web using various tools that perform scraping and crawling. Go is emerging as the language of choice for scraping using a variety of libraries. This book will quickly explain to you, how to scrape data data from various websites using Go libraries such as Colly and Goquery. The book starts with an introduction to the use cases of building a web scraper and the main features of the Go programming language, along with setting up a Go environment. It then moves on to HTTP requests and responses and talks about how Go handles them. You will also learn about a number of basic web scraping etiquettes. You will be taught how to navigate through a website, using a breadth-first and then a depth-first search, as well as find and follow links. You will get to know about the ways to track history in order to avoid loops and to protect your web scraper using proxies. Finally the book will cover the Go concurrency model, and how to run scrapers in parallel, along with large-scale distributed web scraping.
Table of Contents (10 chapters)

The Request/Response Cycle

Before you can build a web scraper, you must take a second and think about how the internet works. At its core, the internet is a network of computers connected together, discoverable through Domain Lookup System (DNS) servers. When you want to visit a website, your browser sends the website URL to a DNS server, the URL is translated into an IP address, and your browser then sends a request to the machine at that IP address. The machine, called a web server, receives and inspects the request, and makes a decision on what to send back to your browser. Your browser then parses the information sent by the server and displays content on your screen depending on the format of the data. The web server and browser are able to communicate because of the adherence to a global set of rules called the HTTP. In this chapter, you will learn some of the key points...