The data flow in Scrapy is controlled by the execution engine and goes like this:
The process starts with locating the chosen spider and opening the first URL from the list of
start_urls
.The first URL is then scheduled as a request in a scheduler. This is more of an internal to Scrapy.
The Scrapy engine then looks for the next set of URLs to crawl.
The scheduler then sends the next URLs to the engine and the engine then forwards it to the downloader using the downloaded middleware. These middlewares are where we place different proxies and user-agent settings.
The downloader downloads the response from the page and passes it to the spider, where the parse method selects specific elements from the response.
Then, the spider sends the processed item to the engine.
The engine sends the processed response to the item pipeline, where we can add some post processing.
The same process continues for each URL until there are no remaining requests.