Preface: Migrating tens of thousands of data integration jobs (for example, DataX jobs) to Apache SeaTunnel is a tedious task.
To solve this problem, X2SeaTunnel was created. It is a generic configuration conversion tool for transforming configuration files from multiple data integration tools (such as DataX, Sqoop, etc.) into SeaTunnel format, helping users migrate smoothly to the SeaTunnel platform.
Meanwhile, this tool is also a meaningful practice of AI Coding and Vibe Coding, so in this article the author also shares insights on using AI to complete product, architecture, code, and delivery in a short time.
Currently, X2SeaTunnel is still at its first-version stage — we hope more friends will join to co-build and share the gains.

Data integration scenario brief

For customer-facing scenarios, we built a data integration product based on open-source Apache SeaTunnel + FlinkCDC, using the Flink engine under the hood, to support massive data synchronization to lake-house platforms. Our data sources mainly include various databases, data warehouses, data lakes, as well as Kafka, HTTP and so on. Typical data targets are Hive, Doris, Iceberg, etc.

Thanks to the evolution of the open-source community, we were able to obtain core data integration capability at a relatively high ROI, so we focused R&D on improving integration reliability and usability.

Also, we gradually feed back discovered bugs and designed features from our scenarios to the open-source community for co-construction and win-win. For example, Hive overwrite writes, Hive auto-create-table, and other commonly requested features — Flink 1.20 support, X2SeaTunnel, SBS data sharding algorithm, and other new features have already been or are planned to be contributed to the Apache SeaTunnel community.

On the Flink engine layer, we have many scene-specific issues. Recently, we have also solved 2PC reliability in SeaTunnel on Flink Streaming mode, including data loss and resume-from-breakpoint problems.

FunctionDescriptionContribution Plan
Hive Coverage ImportA demand required by many customersContributed
Hive Automatic Table CreationQuite convenientModifying and contributing
SeaTunnel on Flink Streaming Mode 2PC ReliabilitySeaTunnel on Flink 1.15+ is unavailable in Streaming mode. For example, Hive and Iceberg have data loss. A large-scale modification of the Flink translation module has been made in SeaTunnel 1.20.0Flink 1.20 is being modified and contributed. There are many related points, and the community jointly maintains the Flink connector
Sampled Balanced ShardingThe first two sharding algorithms in SeaTunnel have uneven sharding in scenarios with large data volume and data skew, and the sharding is also slow, leading to timeouts and performance degradation. The algorithm has been optimized. There's a long story to tell...If community partners need it, it can be contributed later
X2SeaTunnelHistorical migration from DataX and other tools to SeaTunnel, with a lot of manpower input.Contributed, welcome to build together
Some Small FunctionsIssues such as JDBC, Iceberg time zones, format and Doris small write support.Contributed. This topic
.........

X2SeaTunnel: design, development, and delivery

Let’s get into the main topic — I take this opportunity to summarize the design, development, and delivery of X2SeaTunnel.

Scenario and requirements analysis for X2SeaTunnel

In the AI era — especially entering the Agentic era — code becomes cheap, so thinking about whether a feature is worth doing becomes more important. During X2SeaTunnel’s development, I spent significant energy on thinking about these questions.

Is this a real scenario demand?

As the mindmap above shows, X2SeaTunnel’s scenario comes from migrations and upgrades of data platforms. When migrating and upgrading a data platform to a lake-house unified platform, there are many steps and many details.

Among them, upgrading data integration jobs is particularly painful: many customers built data integration platforms years ago based on open-source components like DataX and Sqoop. When migrating to a lake-house platform, the thousands of data integration jobs often become a project “roadblock”. Unlike SQL, which has many open conversion tools, when migrating DataX and Sqoop jobs to lake-house platforms, each company’s integration jobs are different — processing those tasks is labor-intensive. From this scenario, if there is a tool to upgrade integration jobs, it would be very valuable.

Who are the target users for this requirement?

Our current target customers are developers or delivery engineers. So there is no need to design a complex UI — a usable CLI is most appropriate.

Can this requirement be standardized?

This requirement has considerable community demand; some people have implemented related tools but didn’t open-source them because standardization is hard. Although you can quickly customize for each customer, it’s difficult to be universally applicable. For example, DataX’s MySQL source writing to Hive sink has many different scenarios and coding patterns; the conversion rules for different situations are hard to reuse.

Therefore, we should not pursue a perfect one-shot conversion. Just like we shouldn’t expect AI to write perfectly correct code on the first try, we design for a “human + tool” hybrid process that supports secondary modifications. A template system is important.

Is this suitable for open-source co-construction?

Since Apache SeaTunnel has many sources and sinks, and the needs of companies vary, one company cannot cover all needs. If we can co-develop under shared conventions, X2SeaTunnel will become more useful through community contributions.

Is this suitable for AI co-development?

X2SeaTunnel is relatively decoupled, won’t affect production severely, and can be validated quickly — suitable for AI Coding. So from architecture design to coding and to delivery, AI participated heavily; most code was AI-generated. (Recording time matters: implementation was in June 2025. AI evolves monthly — by October 2025 Agent modes can cover lower-level and more complex requirements.)

After discussing with AI and thinking through the above, I decided to seriously implement X2SeaTunnel using AI Coding and open-source it.

Product design for X2SeaTunnel

Even a small tool needs product thinking. By defining boundaries and simplifying flows, we can balance ROI and standardization.

The core concepts of our tool are:

X2SeaTunnel architecture design

Overall flow

As shown above, the overall logic includes the following steps:

  1. Script invocation and tool trigger
  2. Execute sh bin/X2SeaTunnel.sh --config conversion.yaml to call the X2SeaTunnel jar tool. The tool relies on conversion.yaml (optional) or CLI parameters to start the conversion process.
  3. Jar core initialization
  4. At runtime, the Jar infers which SeaTunnel connector type the source config (DataX, Sqoop, etc.) should map to according to the source config and parameters, laying the groundwork for field matching and file conversion.
  5. Rule matching & field filling stage
  6. Traverse connectors and, using the mapping rules library, extract and fill corresponding fields from DataX’s JSON files. Output the field and connector matching status to show what was adapted during conversion.
  7. Conversion output stage
  8. 4.1 Config file conversion: Fill templates and generate SeaTunnel-compatible HOCON/JSON files, output to a target directory.
  9. 4.2 Conversion report output: Traverse source configs to produce a conversion report (convert report) recording details and matching results — for manual inspection and verification to ensure conversion quality.
  10. Rules iteration stage
  11. Based on real conversion scenarios, continuously improve the mapping rules library to cover more data transformation needs and optimize X2SeaTunnel’s adaptability.
  12. After the rules engine matures, adding new conversion rules only requires modifying the mapping-rule library to quickly support new source types.
  13. Using summarized prompts, AI models can quickly generate mapping rules.
  14. The entire flow is rule-driven with human verification, helping migration of data sync jobs to Apache SeaTunnel and supporting feature delivery and iteration.

Key design questions and discussions

During design, I discussed many questions with AI and the community; here are some highlights:

1. Implement in Python or Java?

I initially considered Python because of faster development and less code. But after community communication and considering future usage as an SDK, we implemented Java, which is easier to distribute as a jar.

2. Can AI replace X2SeaTunnel? Or simply use AI to do conversions directly?

For example, give the source DataX JSON to a large LLM and let it do the conversion. I believe that even as AI gets stronger, the tool still has value because:

That said, AI is very valuable: X2SeaTunnel was designed and developed with AI. In the future, it can use AI + prompts to quickly generate templates tailored to scenarios.

3. Pull-based conversion as the core implementation idea

This is an implementation detail; possible approaches include:

  1. Object mapping route: Strongly typed, convert via an object model — code-driven.
  2. Declarative mapping (push style): Traverse source and push mappings to the target — config-driven.
  3. Pull-based logic: Traverse target requirements and pull corresponding fields from source — template-driven.
FeaturesObject Mapping RouteDeclarative Mapping Logic (Push Mode)Usage Logic (Pull Mode)
Basic PrincipleDataX JSON → DataX Object → SeaTunnel Object → SeaTunnel JSONDataX JSON → Traverse source key → Map to target key → SeaTunnel JSONDataX JSON → Traverse required target key → Map from source → SeaTunnel JSON
Type Safety Check✅ Strong typing, compile-time check❌ Weak typing, runtime check❌ Weak typing, runtime check
Extension Difficulty❌ High (need to define object models for each tool, leading to extremely bloated code)✅ Low (only need to add mapping configuration)✅ Low (only need to add templates, but requires abstraction ability for the core framework)
Complex Conversion✅ Java code handles complex logic❌ Difficult to handle complex logic⚠️ Can be handled by converters or through some rules
Configuration Integrity⚠️ Depends on development implementation❌ May miss target configuration items✅ Naturally ensures target configuration integrity
Error Detection✅ Can be checked at compile time❌ Can only be checked at runtime✅ Can check mandatory fields in advance
Mapping DirectionSource → Target (Indirect)Source → Target (Direct)Target → Source (Reverse)

As an object-oriented programmer, my first instinct was to convert DataX JSON into an intermediate object and then map to the SeaTunnel object model (similar to converting SQL via AST). But that seemed overly complex and unnecessary.

Another idea was push vs pull mapping. Although similar in using a mapping engine, their direction is opposite:

I finally chose a pull-based mapping as core, supplemented by some object mapping to handle complex logic. This ensures the target config’s completeness while keeping extensibility and maintainability. If the source misses fields, the conversion report shows it.

4. How to satisfy different users’ custom conversion needs?

Use a template system + custom configuration + conversion report to cover diverse needs. In practice, customers can quickly implement customized conversions.

We used a syntax compatible with SeaTunnel’s HOCON initially; later, for greater expressiveness, we chose Jinja2-style template syntax — details in docs.

Quick usage & demo for X2SeaTunnel

Documentation: https://github.com/apache/seatunnel-tools/blob/main/X2SeaTunnel/README_zh.md

Follow the official doc step-by-step to get started — sample cases show core usage and are easy to pick up.

Use the Release Package

# Download and unzip the release package
unzip x2seatunnel-*.zip
cd x2seatunnel-*/

Basic Usage

# Standard conversion: Use the default template system with built-in common Sources and Sinks
./bin/x2seatunnel.sh -s examples/source/datax-mysql2hdfs.json -t examples/target/mysql2hdfs-result.conf -r examples/report/mysql

# Custom task: Implement customized conversion requirements through custom templates
# Scenario: MySQL → Hive (DataX has no HiveWriter)
# DataX configuration: MySQL → HDFS Custom task: Convert to MySQL → Hive
./bin/x2seatunnel.sh -s examples/source/datax-mysql2hdfs2hive.json -t examples/target/mysql2hive-result.conf -r examples/report

# YAML configuration method (equivalent to the above command-line parameters)
./bin/x2seatunnel.sh -c examples/yaml/datax-mysql2hdfs2hive.yaml

# Batch conversion mode: Process by directory
./bin/x2seatunnel.sh -d examples/source -o examples/target2 -R examples/report2

# Batch mode supports wildcard filtering
./bin/x2seatunnel.sh -d examples/source -o examples/target3 -R examples/report3 --pattern "*-full.json" --verbose

# View help
./bin/x2seatunnel.sh --help

Below are features and the directory structure — straightforward. Note that many docs were AI-authored.

Functional Features

Directory Structure

x2seatunnel/
├── bin/                     # Executable files
│   └── x2seatunnel.sh       # Startup script
├── lib/                     # JAR package files
│   └── x2seatunnel-*.jar    # Core JAR package
├── config/                  # Configuration files
│   └── log4j2.xml           # Log configuration
├── templates/               # Template files
│   ├── template-mapping.yaml # Template mapping configuration
│   ├── report-template.md   # Report template
│   └── datax/               # DataX-related templates
│       ├── custom/          # Custom templates
│       ├── env/             # Environment configuration templates
│       ├── sources/         # Data source templates
│       └── sinks/           # Data target templates
├── examples/                # Examples and tests
│   ├── source/              # Example source files
│   ├── target/              # Generated target files
│   └── report/              # Generated reports
├── logs/                    # Log files
├── LICENSE                  # License
└── README.md                # Usage instructions

Usage Instructions

Basic Syntax

x2seatunnel [OPTIONS]

Command-Line Parameters

OptionLong OptionDescriptionRequired
-s--sourcePath to the source configuration fileYes
-t--targetPath to the target configuration fileYes
-st--source-typeSource configuration type (datax, default: datax)No
-T--templatePath to the custom template fileNo
-r--reportPath to the conversion report fileNo
-c--configPath to the YAML configuration file, containing settings like source, target, report, template, etc.No
-d--directoryDirectory for batch conversion sourcesNo
-o--output-dirOutput directory for batch conversionNo
-p--patternFile wildcard pattern (comma-separated, e.g.: json,xml)No
-R--report-dirReport output directory in batch mode, where single-file reports and summary summary.md will be outputNo
-v--versionShow version informationNo
-h--helpShow help informationNo
--verboseEnable detailed log outputNo

Below I emphasize the template system. X2SeaTunnel uses a DSL-based template system driven by configuration to quickly adapt different sources and targets. Core advantages:

Conversion report

After conversion, view the generated Markdown report containing:

For batch conversion, a summary folder with batch reports is generated, containing:

AI Coding practice and reflections

AI4Me: my LLM exploration journey

I have always been passionate about exploring AI and chasing technological waves.

At one point, when DeepSeek V3 R1 was released and hadn’t yet gone mainstream, I fell into fervor — product, architecture, prototyping, even fortune-telling — trying to use AI for everything.

When DeepSeek went mainstream, I felt lost. Information explosion and noisy input made me lose focus and feel overwhelmed.

Later, I learned to do subtraction.

I stopped chasing AI for its own sake and returned to the core: let AI solve my present problems and live in the moment.

This is the correct distance between humans and AI.

X2SeaTunnel Vibe Coding insights

From Vibe Coding to Spec Coding

Anthropic released a guide on Agent context engineering.

Past mistakes have been abstracted in the industry: Context Engineering and Spec Coding.

Just as writing a good Spark job requires Spark principles, using AI and Agents well requires understanding their mechanisms. Recommended reading:

When to keep it minimal and when to write Specs?

https://github.com/github/spec-kit

My previous approach aligned with Spec Coding but was rough. Next I will systematically adopt Spec Coding for complex projects.

Spec Coding checklist:

  1. Align requirements: discuss requirements thoroughly with AI, invite AI to be a critic/reviewer to expose gaps.
  2. Produce design: ask AI to output a design/spec (goals, constraints, interfaces, data flow, dependencies, tests & acceptance) that is reviewable and executable.
  3. Iterative implementation: decompose the spec → implement → fast feedback; human+AI collaboration in small increments. Use Git for branch/control to ensure auditability and rollback.

Key Agent capabilities from AI coding tools

Quoted: Software 3.0 — Andrej Karpathy

Recommended: Andrej Karpathy’s Software 3.0 (July 2025) — insightful on AI Agents. Current mainstream Agent frameworks & methodologies still fall under his depiction.

Agent projects have been most successfully landed in AI Coding. To understand Agents, start with AI Coding tools.

GitHub Copilot team’s Agent product design highlights (Agent capability core):

  1. Context management: the user and the Agent maintain context for multi-turn tasks so the model stays focused.
  2. Multi-call & orchestration (Agentic Orchestration): allows AI to plan task chains and call multiple tools/functions autonomously.
  3. Efficient human-AI interface: GUI raises collaboration efficiency — GUI acts as a “human GPU”.
  4. Generate-verify loop (Gen-Verify Loop): human+AI build a continuous loop — AI generates, humans verify & correct; feedback improves outcomes.

Fully automated “AI autopilot” is far away. For practitioners, that’s fine — human+AI collaboration already releases huge productivity gains.

AI & data platform convergence and opportunity areas

This section may seem audacious, but I’ll share some thoughts that I follow.

AI×data-platform convergence can be categorized into two directions:

1. Data4AI: providing solid data foundations for AI

Data4AI deals with how to make data better support AI. Multi-modal data management and lake-house architectures are key. They provide a unified, efficient, governable foundation for data prep, training, inference, and model ops.

Thus, traditional data platforms evolve to AI-oriented capabilities: formats like Lance, FileSet management, Raycompute framework, Python-native interfaces — these help AI “eat better and run steadier”.

2. AI4Data: AI-enhancing data platforms

AI4Data asks how AI can enhance platform efficiency and reliability.

Two sub-areas:

All rely on the Agent methodology: AI acts as an autonomous entity with perception, planning, execution, and reflection; humans provide direction, constraints, and value judgment.

From demo to real-world adoption, there’s still a “last mile” which humans fill: understand business, define goals, control boundaries.

The figure above shows a DevAgent prototype I sketched in 2023, inspired by a paper — a “special forces squad” of humans + AI that can collaborate, execute automatically, and continuously learn in complex environments.

Now, that idea is gradually becoming reality. Humans + AI will be standard collaborators.

AI changes daily — what can we do?

AI and Agent progress quickly — we should fully embrace and deeply experience them.

From a team and individual perspective:

As AI amplifies individual efficiency, organizational communication cost may exceed execution value. This expands individual responsibilities while shrinking organizational boundaries. A trend: more “super individuals” emerge.

Not focusing on risks, I find this era interesting. The evolution of AI reminds me of martial arts cultivation: one person, one sword, one path. An individual + AI combination is like forging a personal magic weapon. A good AI tool is a “Heaven Reliant Sword”; a good Spec is like a martial classic. Through practice and iteration, you can accomplish what was once impossible — become stronger and contribute widely.

From an industry perspective:

I fight in the data trenches daily. I ask: Can AI Agents help overcome ToB digital transformation bottlenecks? Can they reduce the landing cost of data platforms?

I think yes.

Quote adapted from Xiaomi startup thinking:

Discover customers’ real pain points and abstract & standardize them; use open source and AI’s momentum to resolve contradictions between customer needs and technical delivery; with reliable, usable, and fairly priced products/services, democratize advanced technology, reduce societal costs, and contribute to national digital transformation.

Start with reliability.

The future is already here — it’s just unevenly distributed

Finally, quoting William Gibson: “The future is already here — it’s just not evenly distributed.” Technological iteration comes with uneven pace and resource allocation, but we can proactively embrace the wave, follow the times, and make tech serve people and social progress.

Above is my sharing — comments and corrections are welcome. I look forward to moving forward together.