Book Image

Go Web Scraping Quick Start Guide

By : Vincent Smith
Book Image

Go Web Scraping Quick Start Guide

By: Vincent Smith

Overview of this book

Web scraping is the process of extracting information from the web using various tools that perform scraping and crawling. Go is emerging as the language of choice for scraping using a variety of libraries. This book will quickly explain to you, how to scrape data data from various websites using Go libraries such as Colly and Goquery. The book starts with an introduction to the use cases of building a web scraper and the main features of the Go programming language, along with setting up a Go environment. It then moves on to HTTP requests and responses and talks about how Go handles them. You will also learn about a number of basic web scraping etiquettes. You will be taught how to navigate through a website, using a breadth-first and then a depth-first search, as well as find and follow links. You will get to know about the ways to track history in order to avoid loops and to protect your web scraper using proxies. Finally the book will cover the Go concurrency model, and how to run scrapers in parallel, along with large-scale distributed web scraping.
Table of Contents (10 chapters)

Summary

In this chapter, we looked under the hood at the components that make a solid web scraping system. We used colly to scrape HTML pages that did not require JavaScript. We used chrome-protocol to drive web browsers to scrape sites that do require JavaScript. Finally, we examined dataflowkit and saw how its architecture opens the door for building distributed web crawlers. There is more to learn and do when it comes to building distributed systems in Go, but this is where the scope of this book ends. I hope you check out some other publications on building applications in Go and continue to hone your skills!