I. Why Did We Abandon Azkaban?

At the beginning, we chose LinkedIn’s open-source Azkaban for scheduling mainly because of two features we valued: first, the clean interface and simple operation; second, the use of “projects” to manage tasks, which felt very intuitive. At that time, the team was just starting to build the data platform, and this lightweight and clear tool perfectly matched our needs. There were also other reasons:

However, as the business scale expanded, Azkaban’s shortcomings gradually surfaced:

  1. Lack of an automatic task failure retry mechanism

Azkaban’s retry strategy is extremely primitive: either manually rerun the task or trigger it by polling status through external scripts. We once experienced a case where a Hive task failed due to a temporary resource shortage, causing more than 20 downstream tasks to be blocked, forcing on-call engineers to intervene in the middle of the night.

  1. Coarse-grained permission control

Azkaban’s permission model only supports project-level read or write. It cannot achieve “User A can only modify Task X but not Task Y.” When multiple teams share the same scheduling platform, such permission chaos frequently leads to misoperations.

  1. No task version management

Every modification of a job file overwrites history with no rollback. We once spent two days investigating incorrect ETL results caused by a single parameter change, just because there was no version traceability.

  1. Poor extensibility

Azkaban’s plugin mechanism is honestly underwhelming. Integrating enterprise WeChat alerts, syncing with internal CMDB, or supporting Spark on K8s — basically, all require source code changes. Meanwhile, community updates are slow, GitHub issues pile up and often go unanswered.

Reflection: Azkaban works fine for small teams with simple workloads. But once the data platform scales up and more teams join, you’ll quickly notice its architectural limitations, and the pain points will keep popping up.

II. Why Choose DolphinScheduler?

By the end of 2022, we began evaluating alternatives — comparing Airflow, XXL-JOB, DolphinScheduler, and other popular schedulers. We ultimately selected DolphinScheduler (hereafter DS), based mainly on:

  1. Rich built-in task types

DS has built-in support for Shell, SQL, Spark, Flink, DataX, Python, and more than a dozen task types, and supports custom plugins — no more writing wrapper scripts for everything.

  1. Comprehensive failure-handling mechanism
  1. Fine-grained permission control

Permission management in DS is very detailed. Permissions can be set at the tenant, project, workflow, and even task level — secure collaboration especially across multiple teams.

  1. Visual DAG + version management

Drag-and-drop DAG editing with dependencies, conditional branches, and subprocesses

Every workflow release automatically saves a version and supports rollback to any historical version

  1. Active Chinese community

As an Apache top-level project, DS has a large user base in China, complete documentation, and quick responses. Several of our production issues were answered in the community within 24 hours.

III. Real Migration Case: From Azkaban to DolphinScheduler

Background

Migration steps

  1. Task inventory and classification
  1. DS environment deployment and testing
  1. Task refactoring and validation
  1. Gray release switching

Pitfalls we encountered

In Azkaban, ${date} automatically injects the current date, whereas DS requires explicitly defining global parameters or using system-built-ins like ${system.datetime}. We wrote a script to convert parameter syntax automatically.

Previously, everything ran in the same YARN queue. Long-running jobs hogged all resources and queued small ones forever. Later, we allocated separate users and queues per business line — finally peaceful, no mutual interference.

At first, every little task failure triggered alerts — hundreds a day — overwhelming. Late,r we tuned the strategy: core jobs alert immediately; non-core ones send daily summaries instead. Much cleaner.

IV. Practical Suggestions for Future Migrators

  1. Don’t blindly pursue “big and comprehensive”

If you only have a few dozen Shell tasks, Cron + simple monitoring might be more efficient. Scheduling systems incur operational costs — evaluate ROI first.

  1. Take permissions and tenant design seriously

Plan tenant structure from day one (e.g., by business line), or chaos will follow later. Enable workflow approval for key task changes.

  1. Establish workflow health indicators

We use Prometheus + Grafana to monitor these and detect risks early.

  1. Make good use of SubProcess

Complicated DAGs quickly become a messy tangle. Pack reusable logic — e.g., data quality checks, log archiving — into subprocesses. Easier reuse, easier maintenance.

  1. Backup and disaster recovery are mandatory

V. Quick Comparison Table: Azkaban vs DolphinScheduler

CapabilityAzkabanDolphinScheduler
Task Retry❌ (Manual Required)✅ (Configurable)
Fine-grained Permission❌ (Project-level Only)✅ (Task-level)
Version Control
Built-in Task TypesLimited (Mainly Shell)Diverse (Including Spark/Flink)
Community Activity (2025)Low✅ High (Apache Project)
Visual DAGWeak (Only Dependency Graph)

Technology selection for a tool is not the finish line, but the starting point for continuous optimization.