Introduction

I’m a software development engineer at Cisco. Our team has been using Apache DolphinScheduler to build our own big data scheduling platform for nearly three years. Starting from version 2.0.3, we’ve grown alongside the community; what I’m sharing today is based on secondary development on version 3.1.1, adding features not included in the community release.

Today I will share how we used Apache DolphinScheduler to build a big data platform, submit and deploy our jobs to AWS, the challenges we encountered, and our solutions.

Architecture Design and Adjustments

Initially, all of our services were deployed on Kubernetes (K8s), including API, Alert, as well as Zookeeper (ZK), Master, and Worker components.

Big Data Processing Jobs

We performed secondary development for Spark, ETL, and Flink tasks:

Supporting Jobs on AWS

With business expansion and data policy requirements, we faced the challenge of running data tasks in multiple regions. This required building an architecture that supported multi‑cluster deployment. Here are the details of our solution and implementation.

Our current architecture includes a centralized control terminal—that is, a single Apache DolphinScheduler service that manages multiple clusters. These clusters are deployed across different geographies, such as the EU and the US, to comply with local data policy and isolation needs.

Architecture Adjustments

To meet this requirement, we made the following modifications:

This design enables a flexible response to diverse business needs and technical challenges while ensuring data isolation and policy compliance.

Next, I’ll discuss the technical implementation and resource dependencies when Apache DolphinScheduler runs jobs in the Cisco Webex DC.

Resource Dependencies and Storage

Since all our jobs run on Kubernetes (K8s), the following are critical to us:

Docker Images

Resource Files and Dependencies

Secure Access and Permission Management

For accessing S3 buckets, we needed to configure and manage AWS credentials:

IAM Account Configuration

AWS IAM Access Key Expiration and Mitigation

During AWS resource access via IAM accounts, we encountered access key expiration issues. Here’s how we addressed it:

Access Key Expiry Challenge

In response, we configured automatic periodic task restarts and monitoring alerts. If an AWS account key shows issues before expiration, our team is notified for timely handling.

Supporting AWS EKS

As business expanded to AWS EKS, we made several adjustments to the architecture and security.

For example, Docker images previously stored in Cisco’s private Docker repo now need to be pushed to AWS ECR.

Support for Multiple S3 Buckets

Due to the distributed AWS clusters and the need for business data isolation, we needed to support multiple S3 buckets:

Secrets Management Tool Migration

To enhance security, we migrated from Cisco’s internal Vault to AWS Secrets Manager (ASM):

We adopted an IAM Role + Service Account model to improve Pod security:

These adjustments not only improved scalability and flexibility but also strengthened our overall security architecture and resolved automatic key expiration issues.

Optimizing Resource Management and Storage Flow

To simplify deployment, we plan to push Docker images directly to ECR rather than via intermediate transfers:

Implementation Changes

AWS Resource Management and Access Isolation

Integrating AWS Secrets Manager (ASM)

We extended DolphinScheduler to support AWS Secrets Manager, allowing users to pick secrets based on cluster type:

ASM Integration Features

Dynamic Resource Configuration & Init Containers

To flexibly manage and initialize AWS resources, we deployed an Init Container:

Using Terraform for Resource Provisioning

We automated AWS resource setup using Terraform, simplifying resource allocation and permission configuration:

Access Isolation and Security

We enforced fine-grained permission and resource isolation across business units:

Implementation Details

Cluster Support and Permission Control Enhancements

Extension of Cluster Types

We added a cluster type field to support different K8s cluster styles—not just Webex DC and AWS EKS, but also high‑security clusters:

Cluster Type Management

Enhanced Permission Control System (Auth)

We developed an Auth system for fine-grained permission control across projects, resources, and namespaces:

Permission Management Features

For example, Team A can only run jobs in A namespace and cannot run jobs in B namespace.

AWS Resource Access and Permission Requests

Through the Auth system and associated tools, we manage AWS resource access and permission requests securely:

Service Account Management & Permission Binding

To improve service account governance and access binding, we implemented:

Service Account Binding Features

Simplified Operations and Resource Synchronization

Although the above sounds extensive, user operations are actually quite straightforward and one-time. To further improve the user experience of running DolphinScheduler in AWS:

Here’s a summary:

Simplified User UI

In DolphinScheduler users can easily configure jobs’ target cluster and namespace:

Choosing Cluster and Namespace

Service Account & Resource Selection

Future Outlook

Looking at our current design, there are still areas for optimization to improve user submission flow and operations:

With these enhancements, we aim to help users deploy and manage their jobs more effectively on DolphinScheduler—whether in Webex DC or on EKS—while improving resource management efficiency and security.