Hey guys, I’m Cai Shunfeng, a senior data engineer at WhaleOps, and a committer and PMC member of the Apache DolphinScheduler community. Today, I will explain how the Worker task of Apache DolphinScheduler works.

This explanation will be divided into three sections:

  1. Introduction to Apache DolphinScheduler
  2. Overview of Apache DolphinScheduler’s overall design
  3. Detailed execution process of Worker tasks

Project Introduction

Apache DolphinScheduler is a distributed, easily extensible, visual workflow scheduling open-source system, suitable for enterprise-level scenarios.

It provides the following key functionalities, offering a full lifecycle data processing solution for workflows and tasks through visual operations.

Key Features

Overall Design

Project Architecture

Next, let’s introduce the overall design background. Below is the design architecture diagram provided on the official website.

From the architecture diagram, we can see that Apache DolphinScheduler is composed of several main components:

Master and Worker Interaction Process

The interaction process between the master and worker is as follows:

Worker Task Reception

When the worker receives a task, the following operations are performed:

The worker checks whether it is overloaded; if so, it rejects the task. After receiving the task distribution failure feedback, the master continues to choose another worker for task distribution based on the distribution strategy.

Worker Execution Process

The specific execution process of worker tasks includes the following steps:

  1. Task Initialization: Initializes the environment and dependencies required for the task.
  2. Task Execution: Executes the specific task logic.
  3. Task Completion: After the task execution is completed, reports the task execution results to the master node.

Next, we will detail the specific task execution process.

Before the task execution begins, a context is first initialized. At this point, the start time of the task is set. To ensure the accuracy of the task, it is necessary to synchronize time between the master and worker to avoid time drift.

Subsequently, the task status is set to running and fed back to the master to notify that the task has started running.

Since most tasks run on the Linux operating system, tenant and file processing are required:

After processing the tenant, the worker creates the specific execution directory. The root directory of the execution directory is configurable and requires appropriate authorization. By default, the directory permissions are set to 755.

During task execution, various resource files may be needed, such as fetching files from AWS S3 or HDFS clusters. The system downloads these files to the worker’s temporary directory for subsequent task use.

In Apache DolphinScheduler, parameter variables can be replaced. The main categories include:

Through the above steps, the task’s execution environment and required resources are ready, and the task can officially start execution.

Different Types of Tasks

In Apache DolphinScheduler, various types of tasks are supported, each applicable to different scenarios and requirements. Below, we introduce several major task types and their specific components.

These components are commonly used to execute script files, suitable for various scripting languages and protocols:

The commercial version(WhaleScheduler) also supports running Java applications by executing JAR packages.

Logic Task Components

These components are used to implement logical control and workflow management:

Big Data Components

These components are mainly used for big data processing and analysis:

Container Components

These components are used to run tasks in a container environment:

Data Quality Components

Used to ensure data quality:

Interactive Components

These components are used to interact with data science and machine learning environments:

Machine Learning Components

These components are used for the management and execution of machine learning tasks:

Overall, Apache DolphinScheduler supports three to four dozen components, covering areas from script execution, big data processing, to machine learning. For more information, please visit the official website to view detailed documentation.

Task Type Abstraction

In Apache DolphinScheduler, task types are abstracted into multiple processing modes to suit various runtime environments and needs.

Below we introduce the abstraction and execution process of task types in detail.

The worker is a JVM service deployed on a server. For some script components (such as Shell, and Python) and locally run tasks (such as Spark Local), they will start a separate process to run.

At this point, the worker interacts with these tasks through the process ID (PID).

Different data sources may require different adaptations. For SQL and stored procedure tasks, we have abstracted handling for different data sources, such as MySQL, PostgreSQL, AWS Redshift, etc. This abstraction allows for flexible adaptation and expansion of different database types.

Remote tasks refer to tasks that are executed on remote clusters, such as AWS EMR, SeaTunnel clusters, Kubernetes clusters, etc. The Worker does not execute these tasks locally; instead, it submits them to the remote clusters and monitors their status and messages. This mode is particularly suited for cloud environments where scalability is required.

Task Execution

Log Collection

Different plugins use different processing modes, and therefore, log collection varies accordingly:

Parameter Variable Substitution

The system scans the task logs to identify any parameter variables that need to be dynamically replaced. For example, Task A in the DAG may generate some output parameters that need to be passed to downstream Task B.

During this process, the system reads the logs and substitutes the parameter variables as required.

Retrieving Task ID

Holding these task IDs allows for further data queries and remote task operations. For instance, when a workflow is stopped, the corresponding cancel API can be called using the task ID to terminate the running task.

Fault Tolerance Handling

Task Execution completion

After a task has been executed, several completion actions are required:

Through these steps, the entire execution process of a task instance is completed.

Community Contribution

If you are interested in Apache DolphinScheduler and want to contribute to the open-source community, you are welcome to refer to our contribution guidelines.

The community encourages active contributions, including but not limited to:

Guide for New Contributors

For new contributors, you can search for issues labeled as good first issue in the community's GitHub issues. These issues are generally simpler and suitable for users making their first contribution.

In summary, we have learned about the overall design of Apache DolphinScheduler and the detailed execution process of Worker tasks.

I hope this content helps you better understand and use Apache DolphinScheduler. If you have any questions, feel free to reach out to me in the comment section.