sia.hackernoon.com

All scrapers can be split into 2 categories that require different infrastructure and techniques:

Generic scrapers. Good example is Googlebot, it wanders around the internet, downloads all the pages it can reach and indexes them in the Google search engine. There are quite a lot of articles and books on how to do generic scraping, so I’m not going to go into details here about it.
Targeted scrapers. This kind of scrapers is when you need to download some pages on the website and extract some structured valuable information. Examples of such scrapers are news scrapers, competitors prices analysis, making a local copy of publicly available data (e.g. US patents). If you need this data just once then it’s fine to just write a simple script and run it on you local machine, but if the data needs to be updated regularly then you’ll soon want to make sure it runs reliably and you have observability over current system status.

Now let’s try to design a system that can execute targeted scrapers regularly. The system only has one functional requirement: it must be able to execute arbitrary scrapers on schedule.

What about non-functional requirements?

It must reliably schedule and execute scrapers
It should not DDoS the domains being scraped
It should be highly available
It should properly handle stuck scrapers (infinite loops, lost worker, etc)
It should provide monitoring for itself and for the scrapers being executed

Here’s an example of how such a system could look like.

In result we have following components:

Scraper configs storage stores information about scraper schedule and parameters that are passed to the scraper logic implementation. For example this could be a static configuration file (if you don’t need to reconfigure scrapers on the fly) or standalone database.
Scrapers jobs storage is used to store results and status of the scraper task.
Scheduler enqueues scraper jobs based on their configs. It needs to make sure that there is no other active job before enqueuing a new one.
Worker consumes the jobs from the queue and executes the scraper logic. It also keeps job status up-to-date, for example it should mark a job as completed or failed after the logic executing is finished.

Now let’s see how everything works together:

Scheduler loads all scrapers configs and starts scheduling the jobs.
Scheduler registers a job in Scrapers jobs storage with a status Queued
Scheduler enqueues registered job in the Jobs queue
Worker dequeues a scraper job from the Jobs queue and marks it as Started
Worker spawns Scraper runner that executes scraper logic. It also starts a thread which continuously pings that the job is still active so Scheduler doesn’t enqueue another job
Scraper runner executes scraper logic.
All of the requests made by the scraper logic are intercepted by Requests middleware which can be used for:
1. Rate limiting requests to certain domains
2. Automatically checking if the request is blocked by robots.txt
3. Using proxy
4. Modifying the request (e.g. adding some headers)
5. Logging
Scraper logic may save its results somewhere or call some APIs based on data it has scraped
When the scraper logic is executed, the Scraper runner marks the job as Succeeded or Failed.

Scraper runner

How does Scraper runner execute scrapers logic? There are multiple various options. Let’s discuss how we can solve this problem in Python since most scrapers are most commonly written using it.

One of the approaches that is used in Scrapyd is to deploy eggified Python packages. Basically you package your application using egg format and then Scrapyd spawns a process that loads the provided egg file and executes it. In this case we have several advantages:

Packages may include dependencies in the egg file and different scrapers may use different versions of the same dependency. This can be especially useful if your system is multi-tenant.
It’s possible to dynamically deploy scrapers.

In this case scraper developers and platform owners are decoupled.

Another approach can be used when it’s not required for the scraper developers and the platform owners to be decoupled, e.g. when only one team develops scrapers and maintains the scraping platform.

In this case what can be done is to have a scraper runner implementation which imports all the scrapers implementations and then executes them based on what it has dequeued from the queue. Here’s an example implementation:

# Scraper implementation interface
class ScraperABC(ABC):
    @property
    @abstractmethod
    def name(self):
        ...

    @abstractmethod
    def execute(config: Any) -> Any:
        ...

# Scraper config that is passed to the Scraper implementation
@dataclass
class ScraperConfig:
    ...

# Sample Scraper implementation
class SampleScraper(ScraperABC):
    @property
    @abstractmethod
    def name(self):
        return "sample_scraper"

    @override
    def execute(config: ScraperConfig) -> Any:
        # <do the actual scrapping>
        text = requests.get("https://example.com").text
        return text[:100]


class Worker:
    def __init__(self, scrapers: list[ScraperABC]):
        self.queue = ...
        self.scrapers = {
          scraper.name: scraper 
          for scraper in scrapers
        }

    # Scraper runner implementation
    def execute_scraper(self, name: str, config: Any) -> Any:
        if name not in self.scrapers:
            raise Exception(f"Unknown scraper {name}")
        return self.scrapers[name].execute(config)

    # Implement consuming the queue
    def start(self):
        while True:
            task = self.queue.dequeue()
            self.execute_scraper(
                task.scraper_name, 
                task.scraper_config
            )

if __name__ == "__main__":
    Worker(
        scrapers=[
            SampleScraper()
        ]
    ).start()

This approach is simpler to maintain and has less caveats although it has some limitations:

All of the scrapers share dependencies, which may not be a big deal since this platform is not intended to be multi-tenant.
Deploying new scrapers is not dynamic and is coupled with deploying the platform which may slow down the release cycle.

Availability and Scalability of the platform

The availability and scalability of the platform comes down to its components.

To have a bit more clear understanding on which components to choose let’s do some back-of-the-envelope calculations. Imagine we have:

1000 scrapers that run every hour
Each scraper on average runs for 15 minutes (therefore we’ll have around 250 running scrapers at any point of time)
Each scraper needs around 0.1 CPU (they usually do mostly IO so we don’t expect much of CPU usage) and 512 MB of memory (so in total just for running scrapers we’ll need 25 CPU and 125 GB of memory)
On average 10 scrapers’ configs are changed within a day

Scraper configs storage - we have mostly read traffic and almost zero write traffic:
- If we use static configuration then there’re no problems at all, we can just distribute the copy across all of the components.
- We can as well use SQL or NoSQL databases to make the system dynamic. It needs to serve around 16 RPM as each scraper needs to reload scraper config before starting executing it.
Job Queue - the queue only needs to serve around 16 RPM, so we shouldn’t aim for anything heavy:
- We can implement a queue via RDBMS if we have one already
- Another option is to use distributed queues such as RabbitMQ, NATS, Redis, Kafka
Scrapers jobs storage - given we have 250 scrapers running at any time and each of them regularly performs health check (so we can kill stuck/timed out tasks), we should have something that can handle 250 write RPS and almost zero read traffic:
- Modern RDBMS should handle this traffic without sweating
- Another option is to use something like Redis, as it most likely ok to not have strong durability guarantees for the scraper jobs.
Scheduler - each scraper job should only be scheduled by one scheduler instance, so how can we achieve this? There are couple of options:
- Have one scheduler replica - if it’s ok to have some downtime for the scheduler it’s a good option since it’s the simplest one.
- In case we don’t tolerate the downtime we can have primary and secondary scheduler replicas. How can we ensure that secondary replica won’t be active? It can be done via acquiring a distributed lock for the right to schedule jobs. To do so we can use either RDBMS or NoSQL DB that are already used in the system.
- Last approach is to shard the scheduled scrapers across multiple replicas, but it only makes sense if number of scrapers is really huge and cannot fit into the memory
Worker - each worker is a stateless service so we can scale it vertically (by adding more resources) or horizontally (by adding more replicas)

In result the final setup can look something like this:

RDBMS (e.g. MySQL) used for:
- Scrapers config storage
- Job Queue
- Scrapers jobs storage
- Scheduler global lock
Scheduler + Workers

Monitoring

There are two parts that needs to be monitored: the platform itself and executed scrapers.

Let’s start with the platform monitoring. We want to have an observability over following things:

Resource utilisation: CPU, Memory, Disk usage of all the components
Queue metrics:
1. Number of pending jobs
2. Processing lag (time since first queue item was enqueued)
3. Average time spent in the queue
4. Enqueue/Dequeue latency
Scheduler metrics:
1. Number of active scheduler replicas
2. Difference between expected scheduled time and actual scheduled time (scheduler delay)
Worker metrics
1. Number of active workers replicas
2. Difference between expected scheduled time and when the task processing started (end-to-end platform delay)
3. Success rate of the executed tasks
4. Number of timed out tasks

All these metrics should tell us how well platform performs overall.

As a platform we could also expose some metric for the scrapers developers so they can keep them up-to-date easily. Here are some metrics that come to mind:

Task success rate over different periods of time - to tell if the scraper implementation is flaky or completely broken.
Number of consecutive failures - this tells us that the scraper is most likely broken at all and should be fixed as soon as possible.
Time since last succeeded task - this ensures we execute scrapers within SLAs (e.g. we may have a requirement to provide the data within 1 day of its releasing)
Execution time - this tells us if we are able to process the data as soon as it arrives and don’t fall behind (e.g. if getting some information about 1 day period takes 2 days we’ll never be able to scrape the data up to a current point of time, hence the scraper needs to be optimised)
Resource consumption (if we spawn a process for each scraper that should be fairly easy to measure)

Existing solutions

There are actually not that many existing solutions that provide full fledged scraping experience.

The most popular is Scrapy - an open source and collaborative framework for extracting the data from websites. In a fast, simple, yet extensible way. There are multiple ways to host it:

Scrapyd - if you’d like to host it using your own resources
Scrapy Cloud or ScrapeOps - cloud solutions to host your spiders

Crawlee is another popular solution to create and run scrapers that use headless browsers such as Playwright or Selenium. Although there’s no built-in scheduling and monitoring capabilites so you’ll either need to implement them yourself or use Apify to host the scrapers.

Another notable option is to use no-code solutions like parsehub or octoparse which allows you to train scrapers by clicking on what you want to scrape.

Designing a Scraping Platform: Generic Scrapers vs. Targeted Scrapers

Scraper runner

Availability and Scalability of the platform

Monitoring

Existing solutions