All scrapers can be split into 2 categories that require different infrastructure and techniques:

  1. Generic scrapers. Good example is Googlebot, it wanders around the internet, downloads all the pages it can reach and indexes them in the Google search engine. There are quite a lot of articles and books on how to do generic scraping, so I’m not going to go into details here about it.
  2. Targeted scrapers. This kind of scrapers is when you need to download some pages on the website and extract some structured valuable information. Examples of such scrapers are news scrapers, competitors prices analysis, making a local copy of publicly available data (e.g. US patents). If you need this data just once then it’s fine to just write a simple script and run it on you local machine, but if the data needs to be updated regularly then you’ll soon want to make sure it runs reliably and you have observability over current system status.

Now let’s try to design a system that can execute targeted scrapers regularly. The system only has one functional requirement: it must be able to execute arbitrary scrapers on schedule.

What about non-functional requirements?

  1. It must reliably schedule and execute scrapers
  2. It should not DDoS the domains being scraped
  3. It should be highly available
  4. It should properly handle stuck scrapers (infinite loops, lost worker, etc)
  5. It should provide monitoring for itself and for the scrapers being executed

Here’s an example of how such a system could look like.

In result we have following components:

Now let’s see how everything works together:

  1. Scheduler loads all scrapers configs and starts scheduling the jobs.

  2. Scheduler registers a job in Scrapers jobs storage with a status Queued

  3. Scheduler enqueues registered job in the Jobs queue

  4. Worker dequeues a scraper job from the Jobs queue and marks it as Started

  5. Worker spawns Scraper runner that executes scraper logic. It also starts a thread which continuously pings that the job is still active so Scheduler doesn’t enqueue another job

  6. Scraper runner executes scraper logic.

  7. All of the requests made by the scraper logic are intercepted by Requests middleware which can be used for:

    1. Rate limiting requests to certain domains
    2. Automatically checking if the request is blocked by robots.txt
    3. Using proxy
    4. Modifying the request (e.g. adding some headers)
    5. Logging
  8. Scraper logic may save its results somewhere or call some APIs based on data it has scraped

  9. When the scraper logic is executed, the Scraper runner marks the job as Succeeded or Failed.

Scraper runner

How does Scraper runner execute scrapers logic? There are multiple various options. Let’s discuss how we can solve this problem in Python since most scrapers are most commonly written using it.

One of the approaches that is used in Scrapyd is to deploy eggified Python packages. Basically you package your application using egg format and then Scrapyd spawns a process that loads the provided egg file and executes it. In this case we have several advantages:

  1. Packages may include dependencies in the egg file and different scrapers may use different versions of the same dependency. This can be especially useful if your system is multi-tenant.
  2. It’s possible to dynamically deploy scrapers.

In this case scraper developers and platform owners are decoupled.

Another approach can be used when it’s not required for the scraper developers and the platform owners to be decoupled, e.g. when only one team develops scrapers and maintains the scraping platform.

In this case what can be done is to have a scraper runner implementation which imports all the scrapers implementations and then executes them based on what it has dequeued from the queue. Here’s an example implementation:

# Scraper implementation interface
class ScraperABC(ABC):
    @property
    @abstractmethod
    def name(self):
        ...

    @abstractmethod
    def execute(config: Any) -> Any:
        ...

# Scraper config that is passed to the Scraper implementation
@dataclass
class ScraperConfig:
    ...

# Sample Scraper implementation
class SampleScraper(ScraperABC):
    @property
    @abstractmethod
    def name(self):
        return "sample_scraper"

    @override
    def execute(config: ScraperConfig) -> Any:
        # <do the actual scrapping>
        text = requests.get("https://example.com").text
        return text[:100]


class Worker:
    def __init__(self, scrapers: list[ScraperABC]):
        self.queue = ...
        self.scrapers = {
          scraper.name: scraper 
          for scraper in scrapers
        }

    # Scraper runner implementation
    def execute_scraper(self, name: str, config: Any) -> Any:
        if name not in self.scrapers:
            raise Exception(f"Unknown scraper {name}")
        return self.scrapers[name].execute(config)

    # Implement consuming the queue
    def start(self):
        while True:
            task = self.queue.dequeue()
            self.execute_scraper(
                task.scraper_name, 
                task.scraper_config
            )

if __name__ == "__main__":
    Worker(
        scrapers=[
            SampleScraper()
        ]
    ).start()

This approach is simpler to maintain and has less caveats although it has some limitations:

  1. All of the scrapers share dependencies, which may not be a big deal since this platform is not intended to be multi-tenant.
  2. Deploying new scrapers is not dynamic and is coupled with deploying the platform which may slow down the release cycle.

Availability and Scalability of the platform

The availability and scalability of the platform comes down to its components.

To have a bit more clear understanding on which components to choose let’s do some back-of-the-envelope calculations. Imagine we have:

  1. 1000 scrapers that run every hour
  2. Each scraper on average runs for 15 minutes (therefore we’ll have around 250 running scrapers at any point of time)
  3. Each scraper needs around 0.1 CPU (they usually do mostly IO so we don’t expect much of CPU usage) and 512 MB of memory (so in total just for running scrapers we’ll need 25 CPU and 125 GB of memory)
  4. On average 10 scrapers’ configs are changed within a day

In result the final setup can look something like this:

Monitoring

There are two parts that needs to be monitored: the platform itself and executed scrapers.

Let’s start with the platform monitoring. We want to have an observability over following things:

  1. Resource utilisation: CPU, Memory, Disk usage of all the components

  2. Queue metrics:

    1. Number of pending jobs
    2. Processing lag (time since first queue item was enqueued)
    3. Average time spent in the queue
    4. Enqueue/Dequeue latency
  3. Scheduler metrics:

    1. Number of active scheduler replicas
    2. Difference between expected scheduled time and actual scheduled time (scheduler delay)
  4. Worker metrics

    1. Number of active workers replicas
    2. Difference between expected scheduled time and when the task processing started (end-to-end platform delay)
    3. Success rate of the executed tasks
    4. Number of timed out tasks

All these metrics should tell us how well platform performs overall.

As a platform we could also expose some metric for the scrapers developers so they can keep them up-to-date easily. Here are some metrics that come to mind:

  1. Task success rate over different periods of time - to tell if the scraper implementation is flaky or completely broken.
  2. Number of consecutive failures - this tells us that the scraper is most likely broken at all and should be fixed as soon as possible.
  3. Time since last succeeded task - this ensures we execute scrapers within SLAs (e.g. we may have a requirement to provide the data within 1 day of its releasing)
  4. Execution time - this tells us if we are able to process the data as soon as it arrives and don’t fall behind (e.g. if getting some information about 1 day period takes 2 days we’ll never be able to scrape the data up to a current point of time, hence the scraper needs to be optimised)
  5. Resource consumption (if we spawn a process for each scraper that should be fairly easy to measure)

Existing solutions

There are actually not that many existing solutions that provide full fledged scraping experience.

The most popular is Scrapy - an open source and collaborative framework for extracting the data from websites. In a fast, simple, yet extensible way. There are multiple ways to host it:

Crawlee is another popular solution to create and run scrapers that use headless browsers such as Playwright or Selenium. Although there’s no built-in scheduling and monitoring capabilites so you’ll either need to implement them yourself or use Apify to host the scrapers.

Another notable option is to use no-code solutions like parsehub or octoparse which allows you to train scrapers by clicking on what you want to scrape.