When using SeaTunnel 2.3.9 to sync data from Oracle to Doris, you may encounter garbled characters—especially if the Oracle database uses the ASCII character set. But don’t panic—this article walks you through why this happens and how to fix it.
🧠 Root Cause
The issue stems from how SeaTunnel reads data from Oracle. If Oracle is using a character set like ASCII, and you're syncing to Doris (which expects proper UTF-8 or other compatible encodings), Chinese characters can become unreadable.
The key is to intercept and re-encode the data when it is read from the Oracle ResultSet.
🔍 Understanding the SeaTunnel Reading Flow
Let’s look at the SeaTunnel internals that handle JDBC data ingestion:
1. JdbcSourceFactory
This class:
- Loads your source configurations.
- Constructs
JdbcSourceConfigandJdbcDialect. - Creates a
JdbcSourceinstance.
2. JdbcSource
This:
- Initializes a
SourceSplitEnumeratorto split the tasks. - Creates a
JdbcSourceReaderto execute them.
3. JdbcSourceReader
Responsible for:
- Building the
JdbcInputFormat. - Repeatedly calling the
pollNext()method to fetch data.
4. pollNext() Method
This method:
- Calls
open()inJdbcInputFormatto prepare thePreparedStatementandResultSet. - Then calls
nextRecord()to process theResultSetand convert it to aSeaTunnelRow.
5. nextRecord() and the Encoding Problem
In JdbcInputFormat:
- The
nextRecord()method callstoInternal()inJdbcRowConverter. - The default implementation uses
JdbcFieldTypeUtils.getString(rs, resultSetIndex).
💥 Problem: If the ResultSet contains Chinese characters stored as ASCII, this method returns garbled text.
✅ Solution Strategy
We need to detect the source encoding and re-encode the data at the moment it's retrieved from the ResultSet.
Here’s how to do it:
🛠 Implementation Steps
Step 1: Add Charset Parameters
In JdbcInputFormat, add:
private final Map<String, String> params;
In the constructor:
public JdbcInputFormat(JdbcSourceConfig config, Map<TablePath, CatalogTable> tables) {
this.jdbcDialect = JdbcDialectLoader.load(config.getJdbcConnectionConfig().getUrl(), config.getCompatibleMode());
this.chunkSplitter = ChunkSplitter.create(config);
this.jdbcRowConverter = jdbcDialect.getRowConverter();
this.tables = tables;
this.params = config.getJdbcConnectionConfig().getProperties(); // <-- get charset info here
}
Step 2: Pass params to the Row Converter
In the nextRecord() method of JdbcInputFormat, update the method call to:
SeaTunnelRow seaTunnelRow = jdbcRowConverter.toInternal(resultSet, splitTableSchema, params);
Step 3: Add Encoding Method
In AbstractJdbcRowConverter, define:
public static String convertCharset(byte[] value, String charSet) {
if (value == null || value.length == 0) {
return null;
}
log.info("Value bytes: {}", Arrays.toString(value));
try {
return new String(value, charSet);
} catch (UnsupportedEncodingException e) {
throw new RuntimeException(e);
}
}
Step 4: Modify toInternal() for String Types
In AbstractJdbcRowConverter, update the STRING type handling like so:
case STRING:
if (params == null || params.isEmpty()) {
fields[fieldIndex] = JdbcFieldTypeUtils.getString(rs, resultSetIndex);
} else {
String sourceCharset = params.get("sourceCharset");
if ("GBK".equalsIgnoreCase(sourceCharset)) {
fields[fieldIndex] = convertCharset(JdbcFieldTypeUtils.getBytes(rs, resultSetIndex), sourceCharset);
} else {
fields[fieldIndex] = JdbcFieldTypeUtils.getString(rs, resultSetIndex);
}
}
break;
Step 5: Rebuild and Deploy
After making the above changes:
- Rebuild the
connector-jdbcmodule. - Replace the existing
connector-jdbc-2.3.9.jarunder SeaTunnel'sconnectorsdirectory. - Restart the SeaTunnel cluster.
🧾 Configuration Tips
- If your Oracle database does not have encoding issues, you don’t need to pass the
sourceCharsetproperty. - If needed, pass it like this in your config:
sourceCharset=GBK
- To debug logging from
connector-jdbc, check the worker logs in the SeaTunnellogsdirectory.
✅ Summary
By adding a simple charset-switching mechanism and tweaking the JDBC source implementation, you can eliminate garbled characters when syncing Oracle data to Doris using SeaTunnel.
No more broken characters—your data pipeline just got smarter. 🚀