Book Image

Mastering Concurrency in Go

By : Nathan Kozyra
Book Image

Mastering Concurrency in Go

By: Nathan Kozyra

Overview of this book

<p>This book will take you through the history of concurrency, how Go utilizes it, how Go differs from other languages, and the features and structures of Go's concurrency core. Each step of the way, the book will present real, usable examples with detailed descriptions of the methodologies used. By the end, you will feel comfortable designing a safe, data-consistent, high-performance concurrent application in Go.</p> <p>The focus of this book is on showing you how Go can be used to program high-performance, robust concurrent programs with Go's unique form of multithreading, which employs goroutines that communicate and synchronize across channels. Designed for any curious developer or systems administrator with an interest in fast, non-blocking, and resource-thrifty systems applications, this book is an invaluable resource to help you understand Go's powerful concurrency focus.</p>
Table of Contents (17 chapters)
Mastering Concurrency in Go
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Building a web spider using goroutines and channels


Let's take the largely useless capitalization application and do something practical with it. Here, our goal is to build a rudimentary spider. In doing so, we'll accomplish the following tasks:

  • Read five URLs

  • Read those URLs and save the contents to a string

  • Write that string to a file when all URLs have been scanned and read

These kinds of applications are written every day, and they're the ones that benefit the most from concurrency and non-blocking code.

It probably goes without saying, but this is not a particularly elegant web scraper. For starters, it only knows a few start points—the five URLs that we supply it. Also, it's neither recursive nor is it thread-safe in terms of data integrity.

That said, the following code works and demonstrates how we can use channels and the select statements:

package main

import(
  "fmt"
  "io/ioutil"
  "net/http"
  "time"
)

var applicationStatus bool
var urls []string
var urlsProcessed int
var foundUrls []string
var fullText string
var totalURLCount int
var wg sync.WaitGroup

var v1 int

First, we have our most basic global variables that we'll use for the application state. The applicationStatus variable tells us that our spider process has begun and urls is our slice of simple string URLs. The rest are idiomatic data storage variables and/or application flow mechanisms. The following code snippet is our function to read the URLs and pass them across the channel:

func readURLs(statusChannel chan int, textChannel chan string) {

  time.Sleep(time.Millisecond * 1)
  fmt.Println("Grabbing", len(urls), "urls")
  for i := 0; i < totalURLCount; i++ {

    fmt.Println("Url", i, urls[i])
    resp, _ := http.Get(urls[i])
    text, err := ioutil.ReadAll(resp.Body)

    textChannel <- string(text)

    if err != nil {
      fmt.Println("No HTML body")
    }

    statusChannel <- 0

  }

}

The readURLs function assumes statusChannel and textChannel for communication and loops through the urls variable slice, returning the text on textChannel and a simple ping on statusChannel. Next, let's look at the function that will append scraped text to the full text:

func addToScrapedText(textChannel chan string, processChannel chan bool) {

  for {
    select {
    case pC := <-processChannel:
      if pC == true {
        // hang on
      }
      if pC == false {

        close(textChannel)
        close(processChannel)
      }
    case tC := <-textChannel:
      fullText += tC

    }

  }

}

We use the addToScrapedText function to accumulate processed text and add it to a master text string. We also close our two primary channels when we get a kill signal on our processChannel. Let's take a look at the evaluateStatus() function:

func evaluateStatus(statusChannel chan int, textChannel chan string, processChannel chan bool) {

  for {
    select {
    case status := <-statusChannel:

      fmt.Print(urlsProcessed, totalURLCount)
      urlsProcessed++
      if status == 0 {

        fmt.Println("Got url")

      }
      if status == 1 {

        close(statusChannel)
      }
      if urlsProcessed == totalURLCount {
        fmt.Println("Read all top-level URLs")
        processChannel <- false
        applicationStatus = false

      }
    }

  }
}

At this juncture, all that the evaluateStatus function does is determine what's happening in the overall scope of the application. When we send a 0 (our aforementioned ping) through this channel, we increment our urlsProcessed variable. When we send a 1, it's a message that we can close the channel. Finally, let's look at the main function:

func main() {
  applicationStatus = true
  statusChannel := make(chan int)
  textChannel := make(chan string)
  processChannel := make(chan bool)
  totalURLCount = 0

  urls = append(urls, "http://www.mastergoco.com/index1.html")
  urls = append(urls, "http://www.mastergoco.com/index2.html")
  urls = append(urls, "http://www.mastergoco.com/index3.html")
  urls = append(urls, "http://www.mastergoco.com/index4.html")
  urls = append(urls, "http://www.mastergoco.com/index5.html")

  fmt.Println("Starting spider")

  urlsProcessed = 0
  totalURLCount = len(urls)

  go evaluateStatus(statusChannel, textChannel, processChannel)

  go readURLs(statusChannel, textChannel)

  go addToScrapedText(textChannel, processChannel)

  for {
    if applicationStatus == false {
      fmt.Println(fullText)
      fmt.Println("Done!")
      break
    }
    select {
    case sC := <-statusChannel:
      fmt.Println("Message on StatusChannel", sC)

    }
  }

}

This is a basic extrapolation of our last function, the capitalization function. However, each piece here is responsible for some aspect of reading URLs or appending its respective content to a larger variable.

In the following code, we created a sort of master loop that lets you know when a URL has been grabbed on statusChannel:

  for {
    if applicationStatus == false {
      fmt.Println(fullText)
      fmt.Println("Done!")
      break
    }
    select {
      case sC := <- statusChannel:
        fmt.Println("Message on StatusChannel",sC)

    }
  }

Often, you'll see this wrapped in go func() as part of a WaitGroup struct, or not wrapped at all (depending on the type of feedback you require).

The control flow, in this case, is evaluateStatus, which works as a channel monitor that lets us know when data crosses each channel and ends execution when it's complete. The readURLs function immediately begins reading our URLs, extracting the underlying data and passing it on to textChannel. At this point, our addToScrapedText function takes each sent HTML file and appends it to the fullText variable. When evaluateStatus determines that all URLs have been read, it sets applicationStatus to false. At this point, the infinite loop at the bottom of main() quits.

As mentioned, a crawler cannot come more rudimentary than this, but seeing a real-world example of how goroutines can work in congress will set us up for safer and more complex examples in the coming chapters.