sia.hackernoon.com

Preface: Migrating tens of thousands of data integration jobs (for example, DataX jobs) to Apache SeaTunnel is a tedious task.

To solve this problem, X2SeaTunnel was created. It is a generic configuration conversion tool for transforming configuration files from multiple data integration tools (such as DataX, Sqoop, etc.) into SeaTunnel format, helping users migrate smoothly to the SeaTunnel platform.

Meanwhile, this tool is also a meaningful practice of AI Coding and Vibe Coding, so in this article the author also shares insights on using AI to complete product, architecture, code, and delivery in a short time.

Currently, X2SeaTunnel is still at its first-version stage — we hope more friends will join to co-build and share the gains.

Data integration scenario brief

For customer-facing scenarios, we built a data integration product based on open-source Apache SeaTunnel + FlinkCDC, using the Flink engine under the hood, to support massive data synchronization to lake-house platforms. Our data sources mainly include various databases, data warehouses, data lakes, as well as Kafka, HTTP and so on. Typical data targets are Hive, Doris, Iceberg, etc.

Thanks to the evolution of the open-source community, we were able to obtain core data integration capability at a relatively high ROI, so we focused R&D on improving integration reliability and usability.

Also, we gradually feed back discovered bugs and designed features from our scenarios to the open-source community for co-construction and win-win. For example, Hive overwrite writes, Hive auto-create-table, and other commonly requested features — Flink 1.20 support, X2SeaTunnel, SBS data sharding algorithm, and other new features have already been or are planned to be contributed to the Apache SeaTunnel community.

On the Flink engine layer, we have many scene-specific issues. Recently, we have also solved 2PC reliability in SeaTunnel on Flink Streaming mode, including data loss and resume-from-breakpoint problems.

Function	Description	Contribution Plan
Hive Coverage Import	A demand required by many customers	Contributed
Hive Automatic Table Creation	Quite convenient	Modifying and contributing
SeaTunnel on Flink Streaming Mode 2PC Reliability	SeaTunnel on Flink 1.15+ is unavailable in Streaming mode. For example, Hive and Iceberg have data loss. A large-scale modification of the Flink translation module has been made in SeaTunnel 1.20.0	Flink 1.20 is being modified and contributed. There are many related points, and the community jointly maintains the Flink connector
Sampled Balanced Sharding	The first two sharding algorithms in SeaTunnel have uneven sharding in scenarios with large data volume and data skew, and the sharding is also slow, leading to timeouts and performance degradation. The algorithm has been optimized. There's a long story to tell...	If community partners need it, it can be contributed later
X2SeaTunnel	Historical migration from DataX and other tools to SeaTunnel, with a lot of manpower input.	Contributed, welcome to build together
Some Small Functions	Issues such as JDBC, Iceberg time zones, format and Doris small write support.	Contributed. This topic
...	...	...

X2SeaTunnel: design, development, and delivery

Let’s get into the main topic — I take this opportunity to summarize the design, development, and delivery of X2SeaTunnel.

Scenario and requirements analysis for X2SeaTunnel

In the AI era — especially entering the Agentic era — code becomes cheap, so thinking about whether a feature is worth doing becomes more important. During X2SeaTunnel’s development, I spent significant energy on thinking about these questions.

Is this a real scenario demand?

As the mindmap above shows, X2SeaTunnel’s scenario comes from migrations and upgrades of data platforms. When migrating and upgrading a data platform to a lake-house unified platform, there are many steps and many details.

Among them, upgrading data integration jobs is particularly painful: many customers built data integration platforms years ago based on open-source components like DataX and Sqoop. When migrating to a lake-house platform, the thousands of data integration jobs often become a project “roadblock”. Unlike SQL, which has many open conversion tools, when migrating DataX and Sqoop jobs to lake-house platforms, each company’s integration jobs are different — processing those tasks is labor-intensive. From this scenario, if there is a tool to upgrade integration jobs, it would be very valuable.

Who are the target users for this requirement?

Our current target customers are developers or delivery engineers. So there is no need to design a complex UI — a usable CLI is most appropriate.

Can this requirement be standardized?

This requirement has considerable community demand; some people have implemented related tools but didn’t open-source them because standardization is hard. Although you can quickly customize for each customer, it’s difficult to be universally applicable. For example, DataX’s MySQL source writing to Hive sink has many different scenarios and coding patterns; the conversion rules for different situations are hard to reuse.

Therefore, we should not pursue a perfect one-shot conversion. Just like we shouldn’t expect AI to write perfectly correct code on the first try, we design for a “human + tool” hybrid process that supports secondary modifications. A template system is important.

Is this suitable for open-source co-construction?

Since Apache SeaTunnel has many sources and sinks, and the needs of companies vary, one company cannot cover all needs. If we can co-develop under shared conventions, X2SeaTunnel will become more useful through community contributions.

Is this suitable for AI co-development?

X2SeaTunnel is relatively decoupled, won’t affect production severely, and can be validated quickly — suitable for AI Coding. So from architecture design to coding and to delivery, AI participated heavily; most code was AI-generated. (Recording time matters: implementation was in June 2025. AI evolves monthly — by October 2025 Agent modes can cover lower-level and more complex requirements.)

After discussing with AI and thinking through the above, I decided to seriously implement X2SeaTunnel using AI Coding and open-source it.

Product design for X2SeaTunnel

Even a small tool needs product thinking. By defining boundaries and simplifying flows, we can balance ROI and standardization.

https://github.com/apache/seatunnel/issues/9507

The core concepts of our tool are:

Lightweight & Simple: Keep the tool lightweight and efficient, focusing on configuration format conversion.
Usability: Provide multiple usage modes — SDK, CLI, single-file and bulk conversion, meeting different scenarios.
Unified Framework: Build a general framework supporting config conversion for multiple integration tools.
Extensibility: Plugin-style design — adding new data type conversions only requires editing config/templates, no code recompilation.

X2SeaTunnel architecture design

Overall flow

As shown above, the overall logic includes the following steps:

Script invocation and tool trigger
Execute sh bin/X2SeaTunnel.sh --config conversion.yaml to call the X2SeaTunnel jar tool. The tool relies on conversion.yaml (optional) or CLI parameters to start the conversion process.
Jar core initialization
At runtime, the Jar infers which SeaTunnel connector type the source config (DataX, Sqoop, etc.) should map to according to the source config and parameters, laying the groundwork for field matching and file conversion.
Rule matching & field filling stage
Traverse connectors and, using the mapping rules library, extract and fill corresponding fields from DataX’s JSON files. Output the field and connector matching status to show what was adapted during conversion.
Conversion output stage
4.1 Config file conversion: Fill templates and generate SeaTunnel-compatible HOCON/JSON files, output to a target directory.
4.2 Conversion report output: Traverse source configs to produce a conversion report (convert report) recording details and matching results — for manual inspection and verification to ensure conversion quality.
Rules iteration stage
Based on real conversion scenarios, continuously improve the mapping rules library to cover more data transformation needs and optimize X2SeaTunnel’s adaptability.
After the rules engine matures, adding new conversion rules only requires modifying the mapping-rule library to quickly support new source types.
Using summarized prompts, AI models can quickly generate mapping rules.
The entire flow is rule-driven with human verification, helping migration of data sync jobs to Apache SeaTunnel and supporting feature delivery and iteration.

Key design questions and discussions

During design, I discussed many questions with AI and the community; here are some highlights:

Use Python or Java for implementation?
Python is quick and simple with less code.
✓ Java matches the project, can be used as an SDK, and is better for distribution.
Can AI replace X2SeaTunnel? Or call AI to implement this function?
✓ X2SeaTunnel still cannot be replaced by AI.
□ AI hallucination issue
□ High AI cost
□ In batch scenarios, AI is difficult to ensure consistency.
Details: Implementation idea of configuration conversion centered on pull-based mode
✓ Take "pull-based mapping" as the core to ensure the integrity of target configuration;
✓ Push-based mode is used to generate reports, and manual inspection is performed to check for missing fields.
How to meet the respective special conversion needs of different users?
✓ Template system: Compatible with Jinja2-style template syntax
✓ Custom configuration: Implement respective special scenarios
✓ Conversion report: Facilitate manual inspection as a fallback
✓ Filter: Support implementing complex functions through configuration

1. Implement in Python or Java?

I initially considered Python because of faster development and less code. But after community communication and considering future usage as an SDK, we implemented Java, which is easier to distribute as a jar.

2. Can AI replace X2SeaTunnel? Or simply use AI to do conversions directly?

For example, give the source DataX JSON to a large LLM and let it do the conversion. I believe that even as AI gets stronger, the tool still has value because:

AI still hallucinates — even with sufficient context it may produce plausible but incorrect conversions.
Calling AI for conversion is costly.
For bulk conversion, AI may not guarantee consistency.

That said, AI is very valuable: X2SeaTunnel was designed and developed with AI. In the future, it can use AI + prompts to quickly generate templates tailored to scenarios.

3. Pull-based conversion as the core implementation idea

This is an implementation detail; possible approaches include:

Object mapping route: Strongly typed, convert via an object model — code-driven.
Declarative mapping (push style): Traverse source and push mappings to the target — config-driven.
Pull-based logic: Traverse target requirements and pull corresponding fields from source — template-driven.

Features	Object Mapping Route	Declarative Mapping Logic (Push Mode)	Usage Logic (Pull Mode)
Basic Principle	DataX JSON → DataX Object → SeaTunnel Object → SeaTunnel JSON	DataX JSON → Traverse source key → Map to target key → SeaTunnel JSON	DataX JSON → Traverse required target key → Map from source → SeaTunnel JSON
Type Safety Check	✅ Strong typing, compile-time check	❌ Weak typing, runtime check	❌ Weak typing, runtime check
Extension Difficulty	❌ High (need to define object models for each tool, leading to extremely bloated code)	✅ Low (only need to add mapping configuration)	✅ Low (only need to add templates, but requires abstraction ability for the core framework)
Complex Conversion	✅ Java code handles complex logic	❌ Difficult to handle complex logic	⚠️ Can be handled by converters or through some rules
Configuration Integrity	⚠️ Depends on development implementation	❌ May miss target configuration items	✅ Naturally ensures target configuration integrity
Error Detection	✅ Can be checked at compile time	❌ Can only be checked at runtime	✅ Can check mandatory fields in advance
Mapping Direction	Source → Target (Indirect)	Source → Target (Direct)	Target → Source (Reverse)

As an object-oriented programmer, my first instinct was to convert DataX JSON into an intermediate object and then map to the SeaTunnel object model (similar to converting SQL via AST). But that seemed overly complex and unnecessary.

Another idea was push vs pull mapping. Although similar in using a mapping engine, their direction is opposite:

Push-style: Starting from source — “here’s what I have, take what you can” — may omit target fields.
Pull-style: Starting from the target — “I need these fields; fetch them from you” — ensures completeness for the target.

I finally chose a pull-based mapping as core, supplemented by some object mapping to handle complex logic. This ensures the target config’s completeness while keeping extensibility and maintainability. If the source misses fields, the conversion report shows it.

4. How to satisfy different users’ custom conversion needs?

Use a template system + custom configuration + conversion report to cover diverse needs. In practice, customers can quickly implement customized conversions.

Template system design idea
Because X2SeaTunnel’s conversion needs are highly flexible, hardcoding rules loses flexibility. Thus, a template system is essential.

We used a syntax compatible with SeaTunnel’s HOCON initially; later, for greater expressiveness, we chose Jinja2-style template syntax — details in docs.

Custom configuration templates
In practice, most scenarios use custom templates. Custom templates enable special-case behaviors.
Conversion report
The conversion report acts as a “safety net” to check whether each conversion is correct — it has great value.
Filters
We provide filters like join, replace, regex_extract, etc. Combined with templates, they can cover most complex scenarios.

Quick usage & demo for X2SeaTunnel

Documentation: https://github.com/apache/seatunnel-tools/blob/main/X2SeaTunnel/README_zh.md

Follow the official doc step-by-step to get started — sample cases show core usage and are easy to pick up.

Use the Release Package

# Download and unzip the release package
unzip x2seatunnel-*.zip
cd x2seatunnel-*/

Basic Usage

# Standard conversion: Use the default template system with built-in common Sources and Sinks
./bin/x2seatunnel.sh -s examples/source/datax-mysql2hdfs.json -t examples/target/mysql2hdfs-result.conf -r examples/report/mysql

# Custom task: Implement customized conversion requirements through custom templates
# Scenario: MySQL → Hive (DataX has no HiveWriter)
# DataX configuration: MySQL → HDFS Custom task: Convert to MySQL → Hive
./bin/x2seatunnel.sh -s examples/source/datax-mysql2hdfs2hive.json -t examples/target/mysql2hive-result.conf -r examples/report

# YAML configuration method (equivalent to the above command-line parameters)
./bin/x2seatunnel.sh -c examples/yaml/datax-mysql2hdfs2hive.yaml

# Batch conversion mode: Process by directory
./bin/x2seatunnel.sh -d examples/source -o examples/target2 -R examples/report2

# Batch mode supports wildcard filtering
./bin/x2seatunnel.sh -d examples/source -o examples/target3 -R examples/report3 --pattern "*-full.json" --verbose

# View help
./bin/x2seatunnel.sh --help

Below are features and the directory structure — straightforward. Note that many docs were AI-authored.

Functional Features

✅ Standard Configuration Conversion: DataX → SeaTunnel configuration file conversion
✅ Custom Template Conversion: Supports user-defined conversion templates
✅ Detailed Conversion Report: Generates conversion reports in Markdown format
✅ Support for Regular Expression Variable Extraction: Extracts variables from configurations using regular expressions, supporting custom scenarios
✅ Batch Conversion Mode: Supports batch conversion of directories and files with wildcards, automatically generating reports and summary reports

Directory Structure

x2seatunnel/
├── bin/                     # Executable files
│   └── x2seatunnel.sh       # Startup script
├── lib/                     # JAR package files
│   └── x2seatunnel-*.jar    # Core JAR package
├── config/                  # Configuration files
│   └── log4j2.xml           # Log configuration
├── templates/               # Template files
│   ├── template-mapping.yaml # Template mapping configuration
│   ├── report-template.md   # Report template
│   └── datax/               # DataX-related templates
│       ├── custom/          # Custom templates
│       ├── env/             # Environment configuration templates
│       ├── sources/         # Data source templates
│       └── sinks/           # Data target templates
├── examples/                # Examples and tests
│   ├── source/              # Example source files
│   ├── target/              # Generated target files
│   └── report/              # Generated reports
├── logs/                    # Log files
├── LICENSE                  # License
└── README.md                # Usage instructions

Usage Instructions

Basic Syntax

x2seatunnel [OPTIONS]

Command-Line Parameters

Option	Long Option	Description	Required
-s	--source	Path to the source configuration file	Yes
-t	--target	Path to the target configuration file	Yes
-st	--source-type	Source configuration type (datax, default: datax)	No
-T	--template	Path to the custom template file	No
-r	--report	Path to the conversion report file	No
-c	--config	Path to the YAML configuration file, containing settings like source, target, report, template, etc.	No
-d	--directory	Directory for batch conversion sources	No
-o	--output-dir	Output directory for batch conversion	No
-p	--pattern	File wildcard pattern (comma-separated, e.g.: json,xml)	No
-R	--report-dir	Report output directory in batch mode, where single-file reports and summary summary.md will be output	No
-v	--version	Show version information	No
-h	--help	Show help information	No
	--verbose	Enable detailed log output	No

Below I emphasize the template system. X2SeaTunnel uses a DSL-based template system driven by configuration to quickly adapt different sources and targets. Core advantages:

Config-driven: All conversion logic is defined in YAML files — no Java code changes needed.
Easy to extend: Adding new source types only requires adding templates and mapping configs.
Unified syntax: Jinja2-style template syntax for readability and maintainability.
Smart mapping: Transformers implement complex parameter mapping logic.

Conversion report

After conversion, view the generated Markdown report containing:

Basic info: conversion time, source/target paths, connector types, conversion status.
Conversion stats: direct mappings, smart conversions, default usage, count & percentage of unmapped fields.
Detailed field mappings: for each field — source value, target value, used filters.
Default value usage: list of fields using defaults.
Unmapped fields: show fields present in DataX but not converted.
Possible errors and warnings: prompts for issues during conversion.

For batch conversion, a summary folder with batch reports is generated, containing:

Conversion overview: overall stats, success rate, elapsed time.
Successful conversions list: list of successfully converted files.
Failed conversions list: failed files and error info (if any).

AI Coding practice and reflections

AI4Me: my LLM exploration journey

I have always been passionate about exploring AI and chasing technological waves.

From Midjourney’s explosion: I bought an account early and immersed in AI art.
ChatGPT era: from 3.5 to 4.0, it was the main driver of my work and thinking; after a pause, GPT-5 took over my workflow.
Claude series: I registered early and used it heavily for coding.
Domestic models: I continuously followed and used ChatGLM, Kimi, MiniCPM, Qianwen, etc.
Local experiments: bought GPU servers to play with small models locally.
RAG & Agent exploration: studied almost all open Agent frameworks, deeply used LangChain, and conceived ideas like Delta2.
DeepSeek V2: domestic success — I supported it financially; V2.5 became my main Claude substitute.
AI Coding & debugging practice: tried auto_coder, TongYi, Cursor, G-Copilot, Aug, CodeX, etc.
AI product experiences: tried many, from Minmax to other domestic products.

At one point, when DeepSeek V3 R1 was released and hadn’t yet gone mainstream, I fell into fervor — product, architecture, prototyping, even fortune-telling — trying to use AI for everything.

When DeepSeek went mainstream, I felt lost. Information explosion and noisy input made me lose focus and feel overwhelmed.

Later, I learned to do subtraction.

I stopped chasing AI for its own sake and returned to the core: let AI solve my present problems and live in the moment.

This is the correct distance between humans and AI.

X2SeaTunnel Vibe Coding insights

AI develops so fast that it’s hard to keep up
This article was written on October 21, 2025. Given AI’s speed of change, these thoughts might be outdated in a couple of months.
Over the past months, I tried many AI coding tools. For X2SeaTunnel, I mainly used Augment and GitHub Copilot. In June, AI Agent mode was just starting — success rates solving complex problems were low. By October, success rates improved significantly for both new code and legacy problems.
Assist AI and save tokens
AI doesn’t just make us efficient — it makes us stronger. My role shifted toward choosing directions and making decisions.
Although AI is obedient, humans must “protect” it — start saving tokens. Don’t waste AI on trivial tasks like spotless formatting or simple mvn package compile issues — these I do myself faster.
Agents can do them, but waste time and tokens. So I tell AI to focus on complex logic while handing basic operations to myself.
Human-in-the-loop quick validation
Like fast CPUs making disk I/O the bottleneck, functional validation becomes the bottleneck in product iteration. So speed up validation — I scripted and automated builds, packaging, verification, and observation for X2SeaTunnel.
Context management: keep rules & docs tracked
I refactored many versions because July’s AI wasn’t good at self-documenting. You must guide AI with docs and iterate step-by-step — verify after each development to avoid chaos. Also, AI writes docs too fast — manually prune invalid docs to avoid wasting tokens.

Question AI appropriately
Because AI’s logic is self-consistent, some issues become hard to detect or lead to over-design — be ready to question it.

From Vibe Coding to Spec Coding

Anthropic released a guide on Agent context engineering.

Past mistakes have been abstracted in the industry: Context Engineering and Spec Coding.

Just as writing a good Spark job requires Spark principles, using AI and Agents well requires understanding their mechanisms. Recommended reading:

Anthropic’s AI Agent Context Engineering Guide
Spec Coding practices (e.g., GitHub’s spec-kit)

When to keep it minimal and when to write Specs?

Short-cycle / ad-hoc tasks (e.g., troubleshooting)
Don’t pile on complex prompts. Keep it simple: give concise context and put related files in one folder so the Agent can explore using bash and file. The goal is to minimize distractions and save “AI attention”.
Complex/long-term projects
Vibe Coding easily detours. Community consensus leans to Spec Coding.
Philosophy: Everything is a Spec — requirements, boundaries, interfaces, and acceptance criteria should be written as clear executable specs.

https://github.com/github/spec-kit

My previous approach aligned with Spec Coding but was rough. Next I will systematically adopt Spec Coding for complex projects.

Spec Coding checklist:

Align requirements: discuss requirements thoroughly with AI, invite AI to be a critic/reviewer to expose gaps.
Produce design: ask AI to output a design/spec (goals, constraints, interfaces, data flow, dependencies, tests & acceptance) that is reviewable and executable.
Iterative implementation: decompose the spec → implement → fast feedback; human+AI collaboration in small increments. Use Git for branch/control to ensure auditability and rollback.

Key Agent capabilities from AI coding tools

Quoted: Software 3.0 — Andrej Karpathy

Recommended: Andrej Karpathy’s Software 3.0 (July 2025) — insightful on AI Agents. Current mainstream Agent frameworks & methodologies still fall under his depiction.

Agent projects have been most successfully landed in AI Coding. To understand Agents, start with AI Coding tools.

GitHub Copilot team’s Agent product design highlights (Agent capability core):

Context management: the user and the Agent maintain context for multi-turn tasks so the model stays focused.
Multi-call & orchestration (Agentic Orchestration): allows AI to plan task chains and call multiple tools/functions autonomously.
Efficient human-AI interface: GUI raises collaboration efficiency — GUI acts as a “human GPU”.
Generate-verify loop (Gen-Verify Loop): human+AI build a continuous loop — AI generates, humans verify & correct; feedback improves outcomes.

Fully automated “AI autopilot” is far away. For practitioners, that’s fine — human+AI collaboration already releases huge productivity gains.

AI & data platform convergence and opportunity areas

This section may seem audacious, but I’ll share some thoughts that I follow.

AI×data-platform convergence can be categorized into two directions:

1. Data4AI: providing solid data foundations for AI

Data4AI deals with how to make data better support AI. Multi-modal data management and lake-house architectures are key. They provide a unified, efficient, governable foundation for data prep, training, inference, and model ops.

Thus, traditional data platforms evolve to AI-oriented capabilities: formats like Lance, FileSet management, Raycompute framework, Python-native interfaces — these help AI “eat better and run steadier”.

2. AI4Data: AI-enhancing data platforms

AI4Data asks how AI can enhance platform efficiency and reliability.

Two sub-areas:

DevAgent (platform building): AI assists development, ops & optimization — enabling system-level self-healing & automation. Under the hood, this calls observability and small-file merge algorithms, etc.
DataAgent (data analytics): AI as analytics assistant to explore data, generate insights, and support decisions — leveraging integration, development, and query tools.

All rely on the Agent methodology: AI acts as an autonomous entity with perception, planning, execution, and reflection; humans provide direction, constraints, and value judgment.

From demo to real-world adoption, there’s still a “last mile” which humans fill: understand business, define goals, control boundaries.

The figure above shows a DevAgent prototype I sketched in 2023, inspired by a paper — a “special forces squad” of humans + AI that can collaborate, execute automatically, and continuously learn in complex environments.

Now, that idea is gradually becoming reality. Humans + AI will be standard collaborators.

AI changes daily — what can we do?

AI and Agent progress quickly — we should fully embrace and deeply experience them.

From a team and individual perspective:

As AI amplifies individual efficiency, organizational communication cost may exceed execution value. This expands individual responsibilities while shrinking organizational boundaries. A trend: more “super individuals” emerge.

Not focusing on risks, I find this era interesting. The evolution of AI reminds me of martial arts cultivation: one person, one sword, one path. An individual + AI combination is like forging a personal magic weapon. A good AI tool is a “Heaven Reliant Sword”; a good Spec is like a martial classic. Through practice and iteration, you can accomplish what was once impossible — become stronger and contribute widely.

From an industry perspective:

I fight in the data trenches daily. I ask: Can AI Agents help overcome ToB digital transformation bottlenecks? Can they reduce the landing cost of data platforms?

I think yes.

Quote adapted from Xiaomi startup thinking:

Discover customers’ real pain points and abstract & standardize them; use open source and AI’s momentum to resolve contradictions between customer needs and technical delivery; with reliable, usable, and fairly priced products/services, democratize advanced technology, reduce societal costs, and contribute to national digital transformation.

Start with reliability.

The future is already here — it’s just unevenly distributed

Finally, quoting William Gibson: “The future is already here — it’s just not evenly distributed.” Technological iteration comes with uneven pace and resource allocation, but we can proactively embrace the wave, follow the times, and make tech serve people and social progress.

Above is my sharing — comments and corrections are welcome. I look forward to moving forward together.

X2SeaTunnel: One-Click Migration from DataX/Sqoop to Apache SeaTunnel