Just as the concurrent.futures
module offers ThreadPoolExecutor
, which facilitates the creation and manipulation of multiple threads, processes belong to the class of ProcessPoolExecutor
. The ProcessPoolExecutor
class, which also featured in the concurrent.futures
pack, was used to implement our parallel Web crawler. In order to implement this case study, we have created a Python module named process_pool_executor_web_crawler.py
.
The code initiates with the imports known from the previous examples, such as requests
, the Manager
module, and so on. In relation to the definition of the tasks, and referring to the use of threads, little has changed compared to the example from the previous chapter, except that now we send data to be manipulated by means of function arguments; refer to the following signatures:
The group_urls_task
function is defined as follows:
def group_urls_task(urls, result_dict, html_link_regex)