As Zoom’s business expanded and its data scenarios grew more complex, the company’s scheduling needs also evolved—from traditional batch processing to unified management of streaming jobs. To address this, Zoom selected Apache DolphinScheduler as the core scheduling framework and built a unified scheduling platform that supports both batch and stream tasks. This platform has been deeply customized and optimized using modern infrastructure like Kubernetes and multi-cloud deployment. In this article, we’ll dive into the system’s architectural evolution, the key challenges Zoom encountered, how they were solved, and the team's plans—all based on real-world production experience.

Background & Challenges: Expanding from Batch to Streaming

In its early stages, Zoom’s data platform focused primarily on Spark SQL batch processing, with tasks scheduled using DolphinScheduler's standard plugins on AWS EMR.

However, new business demands led to a surge in real-time processing needs, such as:

This posed a new challenge for DolphinScheduler: How can streaming tasks be “scheduled” and “managed” just like batch tasks?

Limitations of the Initial Architecture

The Original Approach

In the early integration of streaming jobs, Zoom used DolphinScheduler's Shell task plugin to call the AWS EMR API and launch streaming tasks (e.g., Spark/Flink).

This implementation was simple but quickly revealed several issues:

  1. No state control: After submission, the task exited immediately without tracking status—causing duplicate submissions or false failures.
  2. No task instances or logs: Troubleshooting was difficult due to missing logs and observability.
  3. Fragmented logic: Streaming and batch jobs used different logic paths, making unified maintenance hard.

These issues highlighted the urgent need for a unified batch-stream scheduling architecture.

System Evolution: Introducing a State Machine for Streaming Jobs

To enable stateful scheduling of streaming jobs, Zoom designed a two-stage task model for streaming workloads based on DolphinScheduler's task state machine capability:

1. Submit Task – Submission Phase

2. Track Status Task – Status Tracking Phase

This two-task model effectively addresses several key issues:

High Availability: Handling Master/Worker Failures

In large-scale production, system stability is critical. Zoom implemented robust fault-tolerance for DolphinScheduler Master and Worker nodes.

1. Worker Failure Recovery

2. Master Failure Recovery

In summary, this architecture achieves:

Zoom has migrated both batch and streaming jobs to Kubernetes, using Spark Operator and Flink Operator for cloud-native task orchestration.

Architecture Overview

Multi-Cloud Cluster Scheduling

Online Issues and Mitigation Strategies

Issue 1: Task Duplication Due to Master Crash

DolphinScheduler’s distributed locks are non-blocking, creating race conditions:

Issue 2: Workflow Stuck in READY_STOP State

Future Plans

Zoom plans to further optimize DolphinScheduler to meet increasingly complex production demands. The main areas of focus include:

1. Asynchronous Task Mechanism

2. Upgraded Unified Batch-Stream Scheduling Platform

Final Thoughts

Zoom’s in-depth practice with DolphinScheduler proves the platform’s scalability, stability, and architectural flexibility as an enterprise-grade scheduler. Especially in unified batch-stream scheduling, cloud-native deployment on Kubernetes, and multi-cluster fault tolerance, Zoom’s architecture offers valuable lessons for the community and other enterprise users.

📢 We warmly welcome more developers to join the Apache DolphinScheduler community—share your insights and experiences, and help us build the next-generation open-source scheduler together!

GitHub: https://github.com/apache/dolphinscheduler