Let's review a simple web scraper architecture:
The web changes often. It is a huge and dynamic beast. The scheduler is responsible to make sure that the scraper will always represent data that is fresh and not stale. It is free to do so by deciding at what rate to scrape it for each website or the page that is being scraped; in other words, when is the next scraping going to happen.
In reality, you would want the scheduler to feed from a persistent data store that holds all sources and their upcoming scraping time.
For example, you could hold a record that specifies that the website acme.org
will have to be scraped once every 5 minutes. You could even pour some more sophistication into it. You can state that acme.org
has to be scraped every 5 minutes at day time, but at night time, in order to save your resources, a 30-minute cycle would be good enough.
Whatever your scheduling policy is, it is encapsulated within the Scheduler domain.