Introduction

From version 2.0.3, Cisco Webex has been using Apache DolphinScheduler to establish its big data scheduling platform for nearly three years.

Recently, the Webex team shared their experience of using Apache DolphinScheduler of version 3.1.1 to develop new features not included in the community version, which is mainly about how they use Apache DolphinScheduler to build a big data platform and deploy our tasks to AWS, the challenges they encountered, and the solutions implemented.

The summary of the Webex’s practice on Apache DolphinScheduler is shown below:

Architecture Design and Adjustments

All our services were initially deployed on Kubernetes (K8s), including components like API, Alert, Zookeeper (ZK), Master, and Worker.

Big Data Processing Tasks

We carried out re-development for tasks such as Spark, ETL, and Flink:

Supporting Jobs on AWS

As our business expanded and data policies required, we faced the challenge of running data tasks in different regions. This necessitated building an architecture capable of supporting multiple clusters. Below is a detailed description of our solution and implementation process.

Our current architecture includes a centralized control terminal, which is a single instance of the Apache DolphinScheduler service that manages multiple clusters. These clusters are distributed in different geographical locations, such as the EU and the US, to comply with local data policies and isolation requirements.

Architecture Adjustments

To meet these requirements, we made the following adjustments:

With this design, we can flexibly respond to different business needs and technical challenges while ensuring data isolation and policy compliance.

Next, let me introduce the technical implementation and resource dependencies when running tasks with Apache DolphinScheduler in Cisco Webex DC.

Resource Dependencies and Storage

Since all our tasks run on Kubernetes (K8s), the following points are crucial to us:

Secure Access and Permission Management

To access the S3 Bucket, we need to configure and manage AWS credentials:

IAM Account Access Key Expiration Issues and Countermeasures

In the process of managing AWS resources with IAM accounts, we encountered the issue of access key expiration. Here’s how we addressed this challenge.

To handle this, we set tasks to restart periodically and configured corresponding monitoring. If the AWS account has issues before the expiration time, it will notify the relevant developers for handling.

Supporting AWS EKS

As our business expands to AWS EKS, we need to make a series of adjustments to the existing architecture and security measures.

For example, the Docker image mentioned earlier, which was previously stored in Cisco’s Docker repo, now needs to be pushed to ECR.

Support for Multiple S3 Buckets

Due to the decentralization of AWS clusters and the data isolation requirements of different businesses, we need to support multiple S3 Buckets to meet the data storage needs of different clusters:

Password Management Tool Change

To enhance security, we migrated from Cisco’s self-built Vault service to AWS Secrets Manager (ASM):

These adjustments not only enhance the scalability and flexibility of our system but also strengthen the overall security architecture, ensuring efficient and secure operation in the AWS environment. At the same time, they avoid the issue of automatic key expiration requiring a restart.

Implementation of Changes

AWS Resource Management and Permission Isolation

Integration with AWS Secrets Manager (ASM)

We extended Apache DolphinScheduler to support AWS Secrets Manager, allowing users to select keys in different types of clusters:

ASM Function Integration:

Dynamic Resource Configuration and Initialization Service (Init Container)

To manage and initialize AWS resources more flexibly, we implemented a service called Init Container:

Application of Terraform in Resource Creation and Management

We automated the configuration and management process of AWS resources through Terraform, simplifying resource allocation and permission settings:

Permission Isolation and Security

We ensured the isolation of resources and the management of security risks by implementing fine-grained permission isolation strategies:

Cluster Support and Permission Control Improvements

Cluster Type Expansion

We added a new field called cluster type to support different types of K8s clusters, including standard Webex DC clusters and AWS EKS clusters, as well as clusters with higher security requirements.

Cluster Type Management

Enhanced Permission Control System (Auth System)

We developed an Auth System specifically for fine-grained permission control, including project, resource, and namespace permissions management:

For example, if team A has A namespace, only certain project jobs can run in that namespace. User B cannot see or run team A’s job configurations.

AWS Resource Management and Permission Requests

We manage AWS resource permissions and access control through the Auth system and other tools, making resource allocation more flexible and secure:

Service Account Management and Permission Binding

To better manage service accounts and their permissions, we have implemented the following features:

Service Account Binding and Management

Simplified Operations and Resource Synchronization

Although I’ve covered a lot, the actual process for users is quite simple, as the entire application process is typically a one-time task. To further enhance the user experience of Apache DolphinScheduler in the AWS environment, we have implemented several measures to simplify the operation process and enhance resource synchronization functionality.

Summary:

Future Outlook:

Several areas in the current design could be optimized to improve job submission and ease of operations:

Through these improvements, we aim to help users deploy and manage their jobs more effectively with Apache DolphinScheduler, whether in Webex DC or on EKS, while enhancing the efficiency and security of resource management.