Problem Description

DolphinScheduler version 1.3.9

When querying task instance lists and other interfaces, responses are sometimes slow, taking over 20 seconds to return results.

Problem Diagnosis

(1) Enable MySQL Slow Query Log

Add the following configuration parameters to /etc/mysql.cnf:

slow_query_log = ON 
slow_query_log_file = /data/log/mysql/mysql-slow.log 
long_query_time = 2

(2) Slow Query Log Analysis

The following types of SQL queries frequently time out in the logs:

select * from t_ds_process_instance 
where 1=1 and state in (0, 1, 2, 4) and process_definition_id = 71 
and (schedule_time >= '2022-07-20 00:00:00' and schedule_time <= '2022-07-20 23:59:59.999' 
or start_time >= '2022-07-20 00:00:00' and start_time <= '2022-07-20 23:59:59.999') 
order by start_time desc limit 1;
select instance.*, process.name as process_instance_name from t_ds_task_instance instance 
left join t_ds_process_definition define on instance.process_definition_id = define.id 
left join t_ds_process_instance process on process.id = instance.process_instance_id 
where define.project_id = 6 order by instance.start_time desc LIMIT 0,10;

The main tables involved are t_ds_process_instance and t_ds_task_instance.

Using the explain analyze command, we found that the first type of SQL query consumes time due to using the start_time_index index, which has poor selectivity. A more selective index is needed to improve query performance:

-> Limit: 1 row(s) (cost=0.20 rows=0) (actual time=103.391..103.391 rows=1 loops=1) 
    -> Filter: ((t_ds_process_instance.process_definition_id = 18) and (((t_ds_process_instance.schedule_time >= TIMESTAMP'2022-07-18 00:00:00') and (t_ds_process_instance.schedule_time <= TIMESTAMP'2022-07-18 23:59:59.999')) or ((t_ds_process_instance.start_time >= TIMESTAMP'2022-07-18 00:00:00') and (t_ds_process_instance.start_time <= TIMESTAMP'2022-07-18 23:59:59.999')))) (cost=0.20 rows=0) (actual time=103.391..103.391 rows=1 loops=1) 
        -> Index scan on t_ds_process_instance using start_time_index (reverse) (cost=0.20 rows=2) (actual time=0.030..103.165 rows=1239 loops=1)

The second type of SQL query mainly consumes time during the left join phase, which requires updating the indexed fields:

-> Limit: 10 row(s) (actual time=3601.141..3601.147 rows=10 loops=1) 
    -> Sort row IDs: `instance`.start_time DESC, limit input to 10 row(s) per chunk (actual time=3601.140..3601.145 rows=10 loops=1) 
        -> Table scan on <temporary> (cost=0.01..3774.21 rows=301738) (actual time=0.002..190.179 rows=722743 loops=1) 
            -> Temporary table (cost=469419.96..473194.16 rows=301738) (actual time=3332.896..3551.716 rows=722743 loops=1) 
                -> Nested loop left join (cost=439246.15 rows=301738) (actual time=0.051..1431.254 rows=722743 loops=1) 
                    -> Nested loop inner join (cost=107334.40 rows=301738) (actual time=0.045..1239.699 rows=722743 loops=1) 
                        -> Filter: (define.id is not null) (cost=12.40 rows=121) (actual time=0.025..0.162 rows=121 loops=1) 
                            -> Index lookup on define using process_definition_index (project_id=6) (cost=12.40 rows=121) (actual time=0.025..0.110 rows=121 loops=1) 
                        -> Index lookup on instance using task_instance_index (process_definition_id=define.id) (cost=639.65 rows=2494) (actual time=0.006..9.972 rows=5973 loops=121) 
                    -> Single-row index lookup on process using PRIMARY (id = `instance`.process_instance_id) (cost=1.00 rows=1) (actual time=0.000..0.000 rows=1 loops=722743)

(3) Problem Identification

Optimization Plan

(1) Large Table Archiving

Create t_ds_task_instance_bak using the DDL of t_ds_task_instance.

Note: Remove foreign key constraints:

CONSTRAINT `foreign_key_instance_id` FOREIGN KEY (`process_instance_id`) REFERENCES `t_ds_process_instance` (`id`) ON DELETE CASCADE ON UPDATE RESTRICT
INSERT INTO t_ds_process_instance_copy 
SELECT * FROM t_ds_process_instance 
WHERE start_time < '2022-01-01 00:00:00';
INSERT INTO t_ds_task_instance_copy 
SELECT * FROM t_ds_task_instance 
WHERE start_time < '2022-01-01 00:00:00';
SELECT COUNT(*) FROM t_ds_process_instance_copy;

SELECT COUNT(*) FROM t_ds_process_instance 
WHERE start_time < '2022-01-01 00:00:00';

SELECT COUNT(*) FROM t_ds_task_instance_copy;

SELECT COUNT(*) FROM t_ds_task_instance 
WHERE start_time < '2022-01-01 00:00:00';
DELETE FROM t_ds_process_instance 
WHERE start_time < '2022-01-01 00:00:00';

DELETE FROM t_ds_task_instance 
WHERE start_time < '2022-01-01 00:00:00';

(2) Parameter Optimization

Modify the following parameters in the /etc/my.cnf MySQL configuration file and restart MySQL:

innodb_log_buffer_size = 64M

innodb_buffer_pool_size = 20G  // Adjust based on server memory. If MySQL is on a dedicated machine, set to 70% of physical memory.

innodb_log_file_size = 1G

innodb_thread_concurrency = 0

join_buffer_size = 64M

sort_buffer_size = 64M

innodb_read_io_threads = 16  // Adjust based on server CPU

innodb_write_io_threads = 16  // Adjust based on server CPU

(3) Index Optimization

CREATE INDEX state_index ON t_ds_process_instance(state, process_definition_id);

CREATE INDEX start_time_process_definition_index ON t_ds_task_instance(start_time, process_definition_id);

ALTER TABLE t_ds_task_instance DROP INDEX task_instance_index;