As Zoom’s business expanded and its data scenarios grew more complex, the company’s scheduling needs also evolved—from traditional batch processing to unified management of streaming jobs. To address this, Zoom selected Apache DolphinScheduler as the core scheduling framework and built a unified scheduling platform that supports both batch and stream tasks. This platform has been deeply customized and optimized using modern infrastructure like Kubernetes and multi-cloud deployment. In this article, we’ll dive into the system’s architectural evolution, the key challenges Zoom encountered, how they were solved, and the team's plans—all based on real-world production experience.
Background & Challenges: Expanding from Batch to Streaming
In its early stages, Zoom’s data platform focused primarily on Spark SQL batch processing, with tasks scheduled using DolphinScheduler's standard plugins on AWS EMR.
However, new business demands led to a surge in real-time processing needs, such as:
- Real-time metrics computation using Flink SQL
- Spark Structured Streaming for processing logs and event data
- Long-running streaming tasks requiring state tracking and fault recovery
This posed a new challenge for DolphinScheduler: How can streaming tasks be “scheduled” and “managed” just like batch tasks?
Limitations of the Initial Architecture
The Original Approach
In the early integration of streaming jobs, Zoom used DolphinScheduler's Shell task plugin to call the AWS EMR API and launch streaming tasks (e.g., Spark/Flink).
This implementation was simple but quickly revealed several issues:
- No state control: After submission, the task exited immediately without tracking status—causing duplicate submissions or false failures.
- No task instances or logs: Troubleshooting was difficult due to missing logs and observability.
- Fragmented logic: Streaming and batch jobs used different logic paths, making unified maintenance hard.
These issues highlighted the urgent need for a unified batch-stream scheduling architecture.
System Evolution: Introducing a State Machine for Streaming Jobs
To enable stateful scheduling of streaming jobs, Zoom designed a two-stage task model for streaming workloads based on DolphinScheduler's task state machine capability:
1. Submit Task – Submission Phase
- Runs on the Dolphin Worker
- Submits Flink/Spark Streaming jobs to Yarn or Kubernetes
- Considered successful once the Yarn Application enters the Running state
- Fails immediately if the submission fails
2. Track Status Task – Status Tracking Phase
- Runs on the Dolphin Master
- Periodically checks the running status of Yarn/Kubernetes jobs
- Implemented as an independent task, similar to a dependent task
- Continuously updates job status to DolphinScheduler’s metadata center
This two-task model effectively addresses several key issues:
- Prevents duplicate submissions
- Brings streaming jobs into a unified state and logging system
- Ensures architectural consistency with batch jobs for easier maintenance and scaling
High Availability: Handling Master/Worker Failures
In large-scale production, system stability is critical. Zoom implemented robust fault-tolerance for DolphinScheduler Master and Worker nodes.
1. Worker Failure Recovery
- If the Submit Task is running and the Worker crashes:
- The original task instance is logically deleted
- A new task instance is created and assigned to a healthy Worker
- The previously submitted Yarn Application is not forcibly killed
- If the Track Status Task is running:
- No re-scheduling is needed
- Since the task runs on the Master, the Worker failure does not impact status tracking
2. Master Failure Recovery
- Uses ZooKeeper + MySQL for fault tolerance
- Multiple Master nodes are deployed with a distributed lock for leader election
- When a Master node fails:
- The active node is switched automatically
- All status-tracking tasks are reloaded and resumed
- Idempotent checks and logical deletions are key to preventing task duplication
In summary, this architecture achieves:
- Advantage 1:
- Leverages DolphinScheduler's workflow and task state machine features
- Prevents duplicate job submissions
- Advantage 2:
- Easier debugging and issue resolution
- Streaming jobs now have task instances and logs like batch jobs
- Supports log search and fault diagnostics
- Advantage 3:
- Unified architecture for streaming and batch jobs
- Improved maintainability and consistency across systems
Unified Spark and Flink Scheduling on Kubernetes
Zoom has migrated both batch and streaming jobs to Kubernetes, using Spark Operator and Flink Operator for cloud-native task orchestration.
Architecture Overview
- Spark/Flink jobs are submitted as
SparkApplication
orFlinkDeployment
Custom Resources (CRDs)
- DolphinScheduler creates and manages these CRs
- Task status is synced via the Operator and Kubernetes API Server
- Dolphin Master and Worker pods continuously track pod status using the state machine and reflect it in the scheduling system
Multi-Cloud Cluster Scheduling
- Supports scheduling across multiple cloud Kubernetes clusters (e.g., Cloud X / Cloud Y)
- Scheduling logic and resource management are fully decoupled across clusters
- Enables cross-cloud, unified management of batch and stream tasks
Online Issues and Mitigation Strategies
Issue 1: Task Duplication Due to Master Crash
DolphinScheduler’s distributed locks are non-blocking, creating race conditions:
- Fixes:
- Add a lock acquisition timeout
- Enforce idempotent control for Submit Tasks (avoid duplicate submissions)
- Validate task status before restoring from MySQL
Issue 2: Workflow Stuck in READY_STOP
State
- Cause:
- The Dolphin API lacked optimistic locking when stopping workflows
- Race conditions during multi-threaded state updates led to stuck workflows
- Improvements:
- Add optimistic locks at the API layer
- Refactor long transaction logic
- Add multiple layers of state verification when Master updates task status
Future Plans
Zoom plans to further optimize DolphinScheduler to meet increasingly complex production demands. The main areas of focus include:
1. Asynchronous Task Mechanism
- Decouple submission and status tracking logic
- Allow Worker nodes to execute tasks asynchronously, avoiding long resource blocks
- Lays the foundation for elastic scheduling and advanced dependency handling
2. Upgraded Unified Batch-Stream Scheduling Platform
- Workflow templates will support mixed task types
- Fully unified logs, states, and monitoring
- Enhanced cloud-native capabilities to build a distributed computing hub for large-scale production scheduling
Final Thoughts
Zoom’s in-depth practice with DolphinScheduler proves the platform’s scalability, stability, and architectural flexibility as an enterprise-grade scheduler. Especially in unified batch-stream scheduling, cloud-native deployment on Kubernetes, and multi-cluster fault tolerance, Zoom’s architecture offers valuable lessons for the community and other enterprise users.
📢 We warmly welcome more developers to join the Apache DolphinScheduler community—share your insights and experiences, and help us build the next-generation open-source scheduler together!