Book Image

Parallel Programming with Python

By : Jan Palach, Jan Palach V Cruz da Silva
Book Image

Parallel Programming with Python

By: Jan Palach, Jan Palach V Cruz da Silva

Overview of this book

Table of Contents (16 chapters)
Parallel Programming with Python
Credits
About the Author
Acknowledgments
About the Reviewers
www.PacktPub.com
Preface
Index

Crawling the Web using ProcessPoolExecutor


Just as the concurrent.futures module offers ThreadPoolExecutor, which facilitates the creation and manipulation of multiple threads, processes belong to the class of ProcessPoolExecutor. The ProcessPoolExecutor class, which also featured in the concurrent.futures pack, was used to implement our parallel Web crawler. In order to implement this case study, we have created a Python module named process_pool_executor_web_crawler.py.

The code initiates with the imports known from the previous examples, such as requests, the Manager module, and so on. In relation to the definition of the tasks, and referring to the use of threads, little has changed compared to the example from the previous chapter, except that now we send data to be manipulated by means of function arguments; refer to the following signatures:

The group_urls_task function is defined as follows:

def group_urls_task(urls, result_dict, html_link_regex)

The crawl_task function is defined as...