sia.hackernoon.com

As Zoom’s business expanded and its data scenarios grew more complex, the company’s scheduling needs also evolved—from traditional batch processing to unified management of streaming jobs. To address this, Zoom selected Apache DolphinScheduler as the core scheduling framework and built a unified scheduling platform that supports both batch and stream tasks. This platform has been deeply customized and optimized using modern infrastructure like Kubernetes and multi-cloud deployment. In this article, we’ll dive into the system’s architectural evolution, the key challenges Zoom encountered, how they were solved, and the team's plans—all based on real-world production experience.

Background & Challenges: Expanding from Batch to Streaming

In its early stages, Zoom’s data platform focused primarily on Spark SQL batch processing, with tasks scheduled using DolphinScheduler's standard plugins on AWS EMR.

However, new business demands led to a surge in real-time processing needs, such as:

Real-time metrics computation using Flink SQL
Spark Structured Streaming for processing logs and event data
Long-running streaming tasks requiring state tracking and fault recovery

This posed a new challenge for DolphinScheduler: How can streaming tasks be “scheduled” and “managed” just like batch tasks?

Limitations of the Initial Architecture

The Original Approach

In the early integration of streaming jobs, Zoom used DolphinScheduler's Shell task plugin to call the AWS EMR API and launch streaming tasks (e.g., Spark/Flink).

This implementation was simple but quickly revealed several issues:

No state control: After submission, the task exited immediately without tracking status—causing duplicate submissions or false failures.
No task instances or logs: Troubleshooting was difficult due to missing logs and observability.
Fragmented logic: Streaming and batch jobs used different logic paths, making unified maintenance hard.

These issues highlighted the urgent need for a unified batch-stream scheduling architecture.

System Evolution: Introducing a State Machine for Streaming Jobs

To enable stateful scheduling of streaming jobs, Zoom designed a two-stage task model for streaming workloads based on DolphinScheduler's task state machine capability:

1. Submit Task – Submission Phase

Runs on the Dolphin Worker
Submits Flink/Spark Streaming jobs to Yarn or Kubernetes
Considered successful once the Yarn Application enters the Running state
Fails immediately if the submission fails

2. Track Status Task – Status Tracking Phase

Runs on the Dolphin Master
Periodically checks the running status of Yarn/Kubernetes jobs
Implemented as an independent task, similar to a dependent task
Continuously updates job status to DolphinScheduler’s metadata center

This two-task model effectively addresses several key issues:

Prevents duplicate submissions
Brings streaming jobs into a unified state and logging system
Ensures architectural consistency with batch jobs for easier maintenance and scaling

High Availability: Handling Master/Worker Failures

In large-scale production, system stability is critical. Zoom implemented robust fault-tolerance for DolphinScheduler Master and Worker nodes.

1. Worker Failure Recovery

If the Submit Task is running and the Worker crashes:
- The original task instance is logically deleted
- A new task instance is created and assigned to a healthy Worker
- The previously submitted Yarn Application is not forcibly killed
If the Track Status Task is running:
- No re-scheduling is needed
- Since the task runs on the Master, the Worker failure does not impact status tracking

2. Master Failure Recovery

Uses ZooKeeper + MySQL for fault tolerance
Multiple Master nodes are deployed with a distributed lock for leader election
When a Master node fails:
- The active node is switched automatically
- All status-tracking tasks are reloaded and resumed
- Idempotent checks and logical deletions are key to preventing task duplication

In summary, this architecture achieves:

Advantage 1:
- Leverages DolphinScheduler's workflow and task state machine features
- Prevents duplicate job submissions
Advantage 2:
- Easier debugging and issue resolution
- Streaming jobs now have task instances and logs like batch jobs
- Supports log search and fault diagnostics
Advantage 3:
- Unified architecture for streaming and batch jobs
- Improved maintainability and consistency across systems

Unified Spark and Flink Scheduling on Kubernetes

Zoom has migrated both batch and streaming jobs to Kubernetes, using Spark Operator and Flink Operator for cloud-native task orchestration.

Architecture Overview

Spark/Flink jobs are submitted as SparkApplication or FlinkDeployment Custom Resources (CRDs)

DolphinScheduler creates and manages these CRs
Task status is synced via the Operator and Kubernetes API Server
Dolphin Master and Worker pods continuously track pod status using the state machine and reflect it in the scheduling system

Multi-Cloud Cluster Scheduling

Supports scheduling across multiple cloud Kubernetes clusters (e.g., Cloud X / Cloud Y)
Scheduling logic and resource management are fully decoupled across clusters
Enables cross-cloud, unified management of batch and stream tasks

Online Issues and Mitigation Strategies

Issue 1: Task Duplication Due to Master Crash

DolphinScheduler’s distributed locks are non-blocking, creating race conditions:

Fixes:
- Add a lock acquisition timeout
- Enforce idempotent control for Submit Tasks (avoid duplicate submissions)
- Validate task status before restoring from MySQL

Issue 2: Workflow Stuck in `READY_STOP` State

Cause:
- The Dolphin API lacked optimistic locking when stopping workflows
- Race conditions during multi-threaded state updates led to stuck workflows
Improvements:
- Add optimistic locks at the API layer
- Refactor long transaction logic
- Add multiple layers of state verification when Master updates task status

Future Plans

Zoom plans to further optimize DolphinScheduler to meet increasingly complex production demands. The main areas of focus include:

1. Asynchronous Task Mechanism

Decouple submission and status tracking logic
Allow Worker nodes to execute tasks asynchronously, avoiding long resource blocks
Lays the foundation for elastic scheduling and advanced dependency handling

2. Upgraded Unified Batch-Stream Scheduling Platform

Workflow templates will support mixed task types
Fully unified logs, states, and monitoring
Enhanced cloud-native capabilities to build a distributed computing hub for large-scale production scheduling

Final Thoughts

Zoom’s in-depth practice with DolphinScheduler proves the platform’s scalability, stability, and architectural flexibility as an enterprise-grade scheduler. Especially in unified batch-stream scheduling, cloud-native deployment on Kubernetes, and multi-cluster fault tolerance, Zoom’s architecture offers valuable lessons for the community and other enterprise users.

📢 We warmly welcome more developers to join the Apache DolphinScheduler community—share your insights and experiences, and help us build the next-generation open-source scheduler together!

GitHub: https://github.com/apache/dolphinscheduler

How Zoom Built a Cloud-Native Batch-Stream Scheduling System on Kubernetes