Dmall is a global provider of intelligent retail solutions, supporting the digital transformation of over 430 clients. With rapid business expansion, data synchronization's real-time nature, resource efficiency, and development flexibility have become the three key challenges we must overcome.

Four Stages of Dmall's Data Platform Evolution

Dmall's data platform has undergone four major transformations, always focusing on "faster, more efficient, and more stable."

In the process of building the data platform, we initially used AWS-EMR to quickly establish cloud-based big data capabilities, then reverted to IDC self-built Hadoop clusters, combining open-source cores with self-developed integration, scheduling, and development components, transforming heavy assets into reusable light services. As the business required lower costs and higher elasticity, the team rebuilt the foundation with storage-compute separation and containerization, introducing Apache SeaTunnel for real-time data lake integration. Subsequently, with Apache Iceberg and Paimon as unified storage formats, we formed a new architecture for lakehouse integration, providing a stable, low-cost data foundation for AI and completing the transition from cloud adoption to cloud creation and from offline to real-time.

Storage-Compute Separation Architecture

Dmall UniData's (Data IDE) storage-compute separation architecture uses Kubernetes as the elastic foundation, with Spark, Flink, and StarRocks scaling on demand. Iceberg + JuiceFS unifies lake storage, Hive Metastore manages cross-cloud metadata, and Ranger provides fine-grained access control. This architecture is vendor-neutral and fully controllable across the entire tech stack.

The business benefits are clear: TCO reduced by 40-75%, resource scaling in seconds, the same IDE framework covering integration, scheduling, modeling, querying, and service delivery quickly, with fewer resources and seamless multi-cloud security.

I. Pain Points of the Old Architecture

Before introducing Apache SeaTunnel, Dmall's data platform supported over a dozen storage self-service data synchronization sources like MySQL, Hive, and ES, using Spark’s self-developed solutions for various data sources, customized to connect on demand, but only supported batch processing.

In terms of data import, Dmall's data platform unified ODS data into the data lake, using Apache Iceberg as the lakehouse format, with hourly data downstream being available, ensuring high data reuse and quality.

Previously, we relied on Spark's self-developed synchronization tools, which were stable but suffered from issues like “slow startup, high resource usage, and difficult scalability.”

“It's not that Spark is bad, but it’s too heavy.”

Against the backdrop of cost reduction and efficiency improvement, we re-evaluated the original data integration architecture. While Spark's batch jobs were mature, they were overkill for medium-sized data synchronization tasks. Slow startup, high resource consumption, and long development cycles became bottlenecks for the team's efficiency. More importantly, with the growing demand for real-time business needs, Spark's batch processing model was becoming unsustainable.

DimensionOld Spark SolutionBusiness Impact
High resources2C8G start, idle 40sNot friendly for medium and small-scale data synchronization
High developmentNo abstracted Source/Sink, full-stack developmentFull-stack development increased development and maintenance costs, lowered delivery efficiency
Does not support real-time syncGrowing real-time incremental synchronization needsStill relying on developers to implement using Java/Flink
Limited data sourcesIncreased private cloud deployments and diverse data sourcesDifficulty in quickly meeting business needs for new data source development

That was until we encountered Apache SeaTunnel, and everything started to change.

II. Why SeaTunnel?

“We’re not choosing a tool; we’re choosing the foundation for the next five years of data integration.”

Facing diverse data sources, real-time needs, and resource optimization pressures, we needed a “batch-stream unified, lightweight, efficient, and easily scalable” integration platform. SeaTunnel, with its open-source nature, multi-engine support, rich connectors, and active community, became our final choice. It not only solved Spark’s “heavy” issue but also laid the foundation for lakehouse integration and real-time analytics in the future.

  1. Engine Neutrality: Built-in Zeta, compatible with Spark/Flink, automatically switching based on data volume.
  2. 200+ connectors: Plugin-based; new data sources require only JSON configuration, no Java code.
  3. Batch and stream unified: One configuration supports full, incremental, and CDC.
  4. Active community: GitHub 8.8k stars, 30+ PR merges weekly, with 5 patches we contributed merged within 7 days.

III. New Platform Architecture: Making SeaTunnel "Enterprise-Grade"

“Open-source doesn’t just mean using it as-is, but standing on the shoulders of giants to continue building.”

While SeaTunnel is powerful, to truly apply it in enterprise-level scenarios, we needed an "outer shell"—unified management, scheduling, permissions, rate-limiting, monitoring, etc. We built a set of visual, configurable, and scalable data integration platforms around SeaTunnel, transforming it from an open-source tool into the "core engine" of Dmall's data platform.

3.1 Global Architecture

Using Apache SeaTunnel as the foundation, the platform exposes a unified REST API, allowing external systems like Web UI, Merchant Exchange, and MCP services to call with one click; built-in connector template center allows new storage to be published in minutes with parameter filling, no coding required. The scheduling layer supports mainstream orchestration like Apache DolphinScheduler, Airflow, etc. The engine layer intelligently routes Zeta/Flink/Spark based on data volume, allowing lightweight fast tasks for small jobs and distributed parallel processing for large jobs. The environment is fully cloud-native, supporting K8s, Yarn, and Standalone modes, making it easy to deliver in private cloud scenarios, ensuring "template-as-a-service, engine-switchable, deployment-unbound."

3.2 Data Integration Features

3.3 Integration Features

IV. Secondary Development: Let SeaTunnel Speak "Dmall Dialect"

“No matter how excellent the open-source project, it still can't understand your business 'dialect'.”

SeaTunnel’s plugin mechanism is flexible, but it still requires us to “modify the code” to meet Dmall's custom requirements such as DDH message formats, sharding and merging tables, and dynamic partitioning. Fortunately, SeaTunnel's modular design makes secondary development efficient and controllable. Below are some key modules we've modified, each directly addressing a business pain point.

4.1 Custom DDH-Format CDC

Dmall has developed DDH to collect MySQL binlogs and push them to Kafka using Protobuf. We implemented the following:

4.2 Router Transform: Multi-Table Merging and Dynamic Partitioning

4.3 Hive-Sink Support for Overwrite

The community version only supports append. Based on PR #7843, we modified SeaTunnel to support the overwrite feature:

This improvement has been contributed back to the community and is expected to be released in version 2.3.14.

4.4 Other Patches

V. Pitfalls: Our Real-World Challenges

“Every pitfall is a necessary step toward stability.”

No matter how mature an open-source project is, it’s inevitable to encounter pitfalls when deploying in real business scenarios. During our use of SeaTunnel, we faced version conflicts, asynchronous operations, and consumption delays. Below are some typical "pits" we encountered and the solutions that helped us avoid them.

ProblemPhenomenonRoot CauseSolution
S3 Access FailureSpark 3.3.4 conflicts with SeaTunnel's default Hadoop 3.1.4Two versions of aws-sdk in classpathExclude Spark’s hadoop-client, use SeaTunnel's uber jar
StarRocks ALTER BlockedWrite fails with “column not found”ALTER in SR is asynchronous; clients continue writing and failPoll SHOW ALTER TABLE STATE in the sink, resume writing after the FINISHEDstatus
Slow Kafka ConsumptionOnly 3k messages per secondPolling thread sleeps 100ms on empty messagesContributed PR #7821, added "no sleep on empty polling" mode, increasing throughput to 120k/s

VI. Summary of Benefits: Delivering in Three Months

“Technical value must ultimately be demonstrated with numbers.”

In less than three months of using Apache SeaTunnel, we completed the migration of three merchant production environments. Not only did it “run faster,” but it also “ran cheaper.”

With support for Oracle, cloud storage, Paimon, and StarRocks, we covered all source-side needs, and real-time synchronization is no longer dependent on hand-written Flink code. The template-based, "zero-code" connector integration reduced the development time from several weeks to just 3 days. Resource consumption dropped to only 1/3 of the original Spark solution, with the same data volume running lighter and faster.

With a new UI and on-demand data source permissions, merchant IT teams can now configure tasks and monitor data flows, reducing delivery costs and improving user experience—fulfilling the three key goals of cost reduction, flexibility, and stability.

VII. Next Steps: Lakehouse + AI Dual-Drive

“Data integration is not the end, but the beginning of intelligent analysis.”

Apache SeaTunnel helped us solve the problems of fast and cost-effective data transfer. Next, we need to solve the challenges of accurate and intelligent data transfer. As technologies like Paimon, StarRocks, and LLM mature, we are building a "real-time lakehouse + AI-driven" data platform, enabling data not only to be visible but also to be usable with precision.

In the future, Dmall will write “real-time” and “intelligent” into the next line of code for its data platform:

  1. Lakehouse Upgrade: Fully integrate Paimon + StarRocks, reducing ODS data lake latency from hours to minutes, providing merchants with near-real-time data.
  2. AI Ready: Use MCP services to call LLM to auto-generate synchronization configurations, and introduce vectorized execution engines to create pipelines directly consumable by AI training, enabling "zero-code, intelligent" data integration.
  3. Community Interaction: Track SeaTunnel's main version updates, introduce performance optimizations, and contribute internal improvements as PRs to the community, forming a closed loop of “use-improve-open-source” and continuously amplifying the technical dividend.

VIII. A Message to My Peers

“If you're also struggling with the 'heavy' and 'slow' data synchronization, give SeaTunnel a sprint’s worth of time.”

In just 3 months, we reduced data integration costs to 1/3, improved real-time performance from hourly to minute-level, and compressed development cycles from weeks to days.

SeaTunnel is not a silver bullet, but it is light, fast, and open enough. As long as you're willing to get hands-on, it can become the "new engine" for your data platform.