Problem Description
DolphinScheduler version 1.3.9
When querying task instance lists and other interfaces, responses are sometimes slow, taking over 20 seconds to return results.
Problem Diagnosis
(1) Enable MySQL Slow Query Log
Add the following configuration parameters to /etc/mysql.cnf
:
slow_query_log = ON
slow_query_log_file = /data/log/mysql/mysql-slow.log
long_query_time = 2
(2) Slow Query Log Analysis
The following types of SQL queries frequently time out in the logs:
select * from t_ds_process_instance
where 1=1 and state in (0, 1, 2, 4) and process_definition_id = 71
and (schedule_time >= '2022-07-20 00:00:00' and schedule_time <= '2022-07-20 23:59:59.999'
or start_time >= '2022-07-20 00:00:00' and start_time <= '2022-07-20 23:59:59.999')
order by start_time desc limit 1;
select instance.*, process.name as process_instance_name from t_ds_task_instance instance
left join t_ds_process_definition define on instance.process_definition_id = define.id
left join t_ds_process_instance process on process.id = instance.process_instance_id
where define.project_id = 6 order by instance.start_time desc LIMIT 0,10;
The main tables involved are t_ds_process_instance
and t_ds_task_instance
.
Using the explain analyze
command, we found that the first type of SQL query consumes time due to using the start_time_index
index, which has poor selectivity. A more selective index is needed to improve query performance:
-> Limit: 1 row(s) (cost=0.20 rows=0) (actual time=103.391..103.391 rows=1 loops=1)
-> Filter: ((t_ds_process_instance.process_definition_id = 18) and (((t_ds_process_instance.schedule_time >= TIMESTAMP'2022-07-18 00:00:00') and (t_ds_process_instance.schedule_time <= TIMESTAMP'2022-07-18 23:59:59.999')) or ((t_ds_process_instance.start_time >= TIMESTAMP'2022-07-18 00:00:00') and (t_ds_process_instance.start_time <= TIMESTAMP'2022-07-18 23:59:59.999')))) (cost=0.20 rows=0) (actual time=103.391..103.391 rows=1 loops=1)
-> Index scan on t_ds_process_instance using start_time_index (reverse) (cost=0.20 rows=2) (actual time=0.030..103.165 rows=1239 loops=1)
The second type of SQL query mainly consumes time during the left join phase, which requires updating the indexed fields:
-> Limit: 10 row(s) (actual time=3601.141..3601.147 rows=10 loops=1)
-> Sort row IDs: `instance`.start_time DESC, limit input to 10 row(s) per chunk (actual time=3601.140..3601.145 rows=10 loops=1)
-> Table scan on <temporary> (cost=0.01..3774.21 rows=301738) (actual time=0.002..190.179 rows=722743 loops=1)
-> Temporary table (cost=469419.96..473194.16 rows=301738) (actual time=3332.896..3551.716 rows=722743 loops=1)
-> Nested loop left join (cost=439246.15 rows=301738) (actual time=0.051..1431.254 rows=722743 loops=1)
-> Nested loop inner join (cost=107334.40 rows=301738) (actual time=0.045..1239.699 rows=722743 loops=1)
-> Filter: (define.id is not null) (cost=12.40 rows=121) (actual time=0.025..0.162 rows=121 loops=1)
-> Index lookup on define using process_definition_index (project_id=6) (cost=12.40 rows=121) (actual time=0.025..0.110 rows=121 loops=1)
-> Index lookup on instance using task_instance_index (process_definition_id=define.id) (cost=639.65 rows=2494) (actual time=0.006..9.972 rows=5973 loops=121)
-> Single-row index lookup on process using PRIMARY (id = `instance`.process_instance_id) (cost=1.00 rows=1) (actual time=0.000..0.000 rows=1 loops=722743)
(3) Problem Identification
-
MySQL Configuration Not Optimized The current MySQL setup uses default parameters. The following parameters significantly impact query performance:
innodb_buffer_pool_size
: The buffer pool used by InnoDB to cache index and raw data.innodb_thread_concurrency
: Set to0
to indicate no thread limit, allowing better use of CPU multi-core performance and improving concurrency.innodb_write_io_threads
&innodb_read_io_threads
: InnoDB uses background threads to process read/write I/O requests on data pages.
These parameters need tuning to improve MySQL performance.
-
Unoptimized DolphinScheduler Indexes The indexes involved in the slow queries need optimization.
-
Large Tables in DolphinScheduler The SQL queries involve large tables (
t_ds_task_instance
), with data size over 4GB and more than 2 million records. These should be split and archived.
Optimization Plan
(1) Large Table Archiving
- Create Backup Tables Create
t_ds_process_instance_bak
using the DDL oft_ds_process_instance
.
Create t_ds_task_instance_bak
using the DDL of t_ds_task_instance
.
Note: Remove foreign key constraints:
CONSTRAINT `foreign_key_instance_id` FOREIGN KEY (`process_instance_id`) REFERENCES `t_ds_process_instance` (`id`) ON DELETE CASCADE ON UPDATE RESTRICT
- Move Historical Data to Backup Tables Move data before the archive date (e.g.,
2022-01-01 00:00:00
) to the backup tables:
INSERT INTO t_ds_process_instance_copy
SELECT * FROM t_ds_process_instance
WHERE start_time < '2022-01-01 00:00:00';
INSERT INTO t_ds_task_instance_copy
SELECT * FROM t_ds_task_instance
WHERE start_time < '2022-01-01 00:00:00';
- Verify Backup Success
SELECT COUNT(*) FROM t_ds_process_instance_copy;
SELECT COUNT(*) FROM t_ds_process_instance
WHERE start_time < '2022-01-01 00:00:00';
SELECT COUNT(*) FROM t_ds_task_instance_copy;
SELECT COUNT(*) FROM t_ds_task_instance
WHERE start_time < '2022-01-01 00:00:00';
- Delete Historical Data
DELETE FROM t_ds_process_instance
WHERE start_time < '2022-01-01 00:00:00';
DELETE FROM t_ds_task_instance
WHERE start_time < '2022-01-01 00:00:00';
(2) Parameter Optimization
Modify the following parameters in the /etc/my.cnf
MySQL configuration file and restart MySQL:
innodb_log_buffer_size = 64M
innodb_buffer_pool_size = 20G // Adjust based on server memory. If MySQL is on a dedicated machine, set to 70% of physical memory.
innodb_log_file_size = 1G
innodb_thread_concurrency = 0
join_buffer_size = 64M
sort_buffer_size = 64M
innodb_read_io_threads = 16 // Adjust based on server CPU
innodb_write_io_threads = 16 // Adjust based on server CPU
(3) Index Optimization
CREATE INDEX state_index ON t_ds_process_instance(state, process_definition_id);
CREATE INDEX start_time_process_definition_index ON t_ds_task_instance(start_time, process_definition_id);
ALTER TABLE t_ds_task_instance DROP INDEX task_instance_index;