sia.hackernoon.com

Keywords: Big Data, Data Scheduling, Workflow, Batch Stop

Introduction

After experimenting with Apache DolphinScheduler and using it in a real project, I ran into a problem: too many workflows got stuck in a “running” state without making progress. Manually stopping them was painfully slow and exhausting. Here’s how I solved it.

Background

Heavy task dependencies. Some tasks had a large number of downstream dependencies. Once a single task failed, all downstream tasks were forced to wait. This easily led to large workflow bottlenecks.
Excessive looping tasks. In some cases, tasks were configured to repeatedly trigger other tasks, generating endless loops. These looped tasks consumed execution slots and caused deadlocks across the workflow.

The Symptoms

A large number of workflow instances appeared to be “running,” but were not executing anything. They occupied their task group slots and blocked other jobs from running. When too many accumulated, manually stopping them one by one became impractical.

Key Considerations

Before killing tasks in bulk, I had to consider two important factors:

Full-load vs. incremental tasks: Killing full-load tasks doesn’t risk data loss. But incremental tasks must be handled manually.
Impact on downstream tasks: For downstream or customer-facing workflows, stopping jobs could delay updates. Fortunately, in my case, skipping one day’s data did not affect the final results.

Solution

1. Using DolphinScheduler’s API

DolphinScheduler provides REST APIs to create, query, and stop workflows. By leveraging these APIs, we can automate batch termination, instead of relying on manual clicks.

2. Python Script Automation

To streamline the process, I wrote a simple Python script.

Script name: dolpschedule-kill.py

# -*- coding: utf-8 -*-
import requests
# Note: This environment only supports Python 2.7, so the script is not Python 3.

# Base API endpoint
BASE_URL = "http://XXX.XXX.XXX.XXX:12345/dolphinscheduler"

# Project code (can be found via DB query or in Project Management -> Project List)
PROJECT_CODE = "12194663850176"

# Token (created in Security Center -> Token Management)
token = "6bff15e17667d95fdffceda08a19cc6c"

# 1. Fetch running workflows
def get_running_tasks(token, pageNo=1, pageSize=10):
    headers = {"token": token}
    task_list_url = "{0}/projects/{1}/process-instances?pageNo={2}&pageSize={3}&stateType=RUNNING_EXECUTION".format(
        BASE_URL, PROJECT_CODE, pageNo, pageSize)
    resp = requests.get(task_list_url, headers=headers)
    return [item['id'] for item in resp.json()['data']['totalList']]

# 2. Stop workflows in bulk
def batch_stop_tasks(token, task_ids):
    headers = {"token": token}
    for task_id in task_ids:
        stop_url = "{0}/projects/{1}/executors/execute?processInstanceId={2}&executeType=STOP".format(
            BASE_URL, PROJECT_CODE, task_id)
        resp = requests.post(stop_url, headers=headers)
        print("Task {0} stopped: {1}".format(task_id, resp.status_code))

# Main flow
if __name__ == "__main__":
    # Kill up to 100 tasks per execution
    running_tasks_ids = get_running_tasks(token, pageNo=1, pageSize=100)
    print("Found {0} running tasks".format(len(running_tasks_ids)))
    batch_stop_tasks(token, running_tasks_ids)

3. Running the Script

python dolpschedule-kill.py

4. Results

Each stopped task returned 200, confirming success.

Final Outcome

With this script, I was able to batch kill all deadlocked workflows.

That said, sometimes individual task instances (not workflows) remain stuck. These cannot be terminated via the API. In those cases, you’ll need to manually fix them in the backend database. For reference, check out my earlier article: 6 High-Frequency SQL Operation Tips for DolphinScheduler.

How to Batch Kill Running Workflows in Apache DolphinScheduler