sia.hackernoon.com

I. Why Did We Abandon Azkaban?

At the beginning, we chose LinkedIn’s open-source Azkaban for scheduling mainly because of two features we valued: first, the clean interface and simple operation; second, the use of “projects” to manage tasks, which felt very intuitive. At that time, the team was just starting to build the data platform, and this lightweight and clear tool perfectly matched our needs. There were also other reasons:

Active community (at the time)
Simple deployment with few dependencies (only MySQL + Web Server + Executor)
Supports job file–defined dependencies, suitable for DAG scenarios

However, as the business scale expanded, Azkaban’s shortcomings gradually surfaced:

Lack of an automatic task failure retry mechanism

Azkaban’s retry strategy is extremely primitive: either manually rerun the task or trigger it by polling status through external scripts. We once experienced a case where a Hive task failed due to a temporary resource shortage, causing more than 20 downstream tasks to be blocked, forcing on-call engineers to intervene in the middle of the night.

Coarse-grained permission control

Azkaban’s permission model only supports project-level read or write. It cannot achieve “User A can only modify Task X but not Task Y.” When multiple teams share the same scheduling platform, such permission chaos frequently leads to misoperations.

No task version management

Every modification of a job file overwrites history with no rollback. We once spent two days investigating incorrect ETL results caused by a single parameter change, just because there was no version traceability.

Poor extensibility

Azkaban’s plugin mechanism is honestly underwhelming. Integrating enterprise WeChat alerts, syncing with internal CMDB, or supporting Spark on K8s — basically, all require source code changes. Meanwhile, community updates are slow, GitHub issues pile up and often go unanswered.

Reflection: Azkaban works fine for small teams with simple workloads. But once the data platform scales up and more teams join, you’ll quickly notice its architectural limitations, and the pain points will keep popping up.

II. Why Choose DolphinScheduler?

By the end of 2022, we began evaluating alternatives — comparing Airflow, XXL-JOB, DolphinScheduler, and other popular schedulers. We ultimately selected DolphinScheduler (hereafter DS), based mainly on:

Rich built-in task types

DS has built-in support for Shell, SQL, Spark, Flink, DataX, Python, and more than a dozen task types, and supports custom plugins — no more writing wrapper scripts for everything.

Comprehensive failure-handling mechanism

Supports task-level retry (configurable retry count and interval)
Supports failure alerts (email, DingTalk, enterprise WeChat)
Supports “skip after failure” or “terminate workflow on failure”

Fine-grained permission control

Permission management in DS is very detailed. Permissions can be set at the tenant, project, workflow, and even task level — secure collaboration especially across multiple teams.

Visual DAG + version management

Drag-and-drop DAG editing with dependencies, conditional branches, and subprocesses

Every workflow release automatically saves a version and supports rollback to any historical version

Active Chinese community

As an Apache top-level project, DS has a large user base in China, complete documentation, and quick responses. Several of our production issues were answered in the community within 24 hours.

III. Real Migration Case: From Azkaban to DolphinScheduler

Background

Original system: Azkaban 3.80, around 150 workflows, 800+ daily tasks
Goal: Smooth migration to DS 3.1.2 with no impact on business data output

Migration steps

Task inventory and classification

Perform a full inventory of existing Azkaban jobs. Classify by type (e.g., Shell scripts, Hive SQL, Spark jobs), then focus on identifying strong dependencies and mapping complete upstream-downstream relationships.
Mark strong dependency chains (e.g., A → B → C)

DS environment deployment and testing

Deploy DS cluster (Master + Worker + API Server + Alert Server)
Create tenants, users, projects, and configure resource queues (YARN)

Task refactoring and validation

Convert Azkaban’s .job files into DS workflow definitions
Key conversions: parameter passing (Azkaban uses ${}; DS also uses ${} but syntax differs slightly); dependency logic (Azkaban uses dependencies; DS uses DAG edges)
Run full workflows in the test environment and verify data consistency

Gray release switching

First, migrate non-core report jobs (e.g., operation daily reports)
Observe for one week, then gradually migrate core pipelines (e.g., user behavior ETL)
Eventually switch all over, keeping Azkaban in read-only mode for 1 month for traceability

Pitfalls we encountered

Pitfall 1: Inconsistent parameter passing

In Azkaban, ${date} automatically injects the current date, whereas DS requires explicitly defining global parameters or using system-built-ins like ${system.datetime}. We wrote a script to convert parameter syntax automatically.

Pitfall 2: Resource isolation issues

Previously, everything ran in the same YARN queue. Long-running jobs hogged all resources and queued small ones forever. Later, we allocated separate users and queues per business line — finally peaceful, no mutual interference.

Pitfall 3: Alert storms

At first, every little task failure triggered alerts — hundreds a day — overwhelming. Late,r we tuned the strategy: core jobs alert immediately; non-core ones send daily summaries instead. Much cleaner.

IV. Practical Suggestions for Future Migrators

Don’t blindly pursue “big and comprehensive”

If you only have a few dozen Shell tasks, Cron + simple monitoring might be more efficient. Scheduling systems incur operational costs — evaluate ROI first.

Take permissions and tenant design seriously

Plan tenant structure from day one (e.g., by business line), or chaos will follow later. Enable workflow approval for key task changes.

Establish workflow health indicators

Task failure rate
Average runtime fluctuation
Dependency blocking frequency

We use Prometheus + Grafana to monitor these and detect risks early.

Make good use of SubProcess

Complicated DAGs quickly become a messy tangle. Pack reusable logic — e.g., data quality checks, log archiving — into subprocesses. Easier reuse, easier maintenance.

Backup and disaster recovery are mandatory

Regularly back up DS metadata database (MySQL/PostgreSQL)
Configure multi-Master HA
Enable cross-cluster DR for critical workflows (failover if the primary cluster goes down)

V. Quick Comparison Table: Azkaban vs DolphinScheduler

Capability	Azkaban	DolphinScheduler
Task Retry	❌ (Manual Required)	✅ (Configurable)
Fine-grained Permission	❌ (Project-level Only)	✅ (Task-level)
Version Control	❌	✅
Built-in Task Types	Limited (Mainly Shell)	Diverse (Including Spark/Flink)
Community Activity (2025)	Low	✅ High (Apache Project)
Visual DAG	Weak (Only Dependency Graph)	✅

Technology selection for a tool is not the finish line, but the starting point for continuous optimization.

Why We Migrated from Azkaban to DolphinScheduler