Sort Plugin

Overview

InLong Sort is an ETL service based on Apache Flink SQL, the powerful expressive power of Flink SQL brings high scalability and flexibility. Basically, the semantics supported by Flink SQL are supported by InLong Sort. In some scenarios, when the built-in functions of Flink SQL do not meet the requirements, they can also be extended through various UDFs in InLong Sort. At the same time, it will be easier for those who have used SQL, especially Flink SQL, to get started.

This article describes how to extend a new source (abstracted as extract node in inlong) or a new sink (abstracted as load node in inlong) in InLong Sort. The architecture of inlong sort can be represented by UML object relation diagram as:

sort_UML

The concepts of each component are:

NameDescription
Groupdata flow group, including multiple data flows, one group represents one data access
Streamdata flow, a data flow has a specific flow direction
GroupInfoencapsulation of data flow in sort. a groupinfo can contain multiple dataflowinfo
StreamInfoabstract of data flow in sort, including various sources, transformations, destinations, etc.
Nodeabstraction of data source, data transformation and data destination in data synchronization
ExtractNodesource-side abstraction for data synchronization
TransformNodetransformation process abstraction of data synchronization
LoadNodedestination abstraction for data synchronization
NodeRelationShipabstraction of each node relationship in data synchronization
FieldRelationShipabstraction of the relationship between upstream and downstream node fields in data synchronization
FieldInfonode field
MetaFieldInfonode meta fields
Functionabstraction of transformation function
FunctionParaminput parameter abstraction of function
ConstantParamconstant parameters

To extend the extract node or load node, you need to do the following:

  • Inherit the node class (such as MyExtractNode) and build specific extract or load usage logic;
  • In a specific node class (such as MyExtractNode), specify the corresponding Flink connector;
  • Use specific node classes in specific ETL implementation logic (such as MyExtractNode)

In the second step, you can use the existing flick connector or extend it yourself. How to extend the flink connector, please refer to the official flink documentationDataStream Connectors.

Extend a new extract node

There are three steps to extend an ExtractNode:

Step 1:Inherit the ExtractNode class,the location of the class is:

  1. inlong-sort/sort-common/src/main/java/org/apache/inlong/sort/protocol/node/ExtractNode.java

Specify the connector in the implemented ExtractNode.

  1. // Inherit ExtractNode class and implement specific classes, such as MongoExtractNode
  2. @EqualsAndHashCode(callSuper = true)
  3. @JsonTypeName("MongoExtract")
  4. @Data
  5. public class MongoExtractNode extends ExtractNode implements Serializable {
  6. @JsonInclude(Include.NON_NULL)
  7. @JsonProperty("primaryKey")
  8. private String primaryKey;
  9. ...
  10. @JsonCreator
  11. public MongoExtractNode(@JsonProperty("id") String id, ...) { ... }
  12. @Override
  13. public Map<String, String> tableOptions() {
  14. Map<String, String> options = super.tableOptions();
  15. // configure the specified connector, here is mongodb-cdc
  16. options.put("connector", "mongodb-cdc");
  17. ...
  18. return options;
  19. }
  20. }

Step 2:add the Extract to JsonSubTypes in ExtractNode and Node

  1. // add field in JsonSubTypes of ExtractNode and Node
  2. ...
  3. @JsonSubTypes({
  4. @JsonSubTypes.Type(value = MongoExtractNode.class, name = "mongoExtract")
  5. })
  6. ...
  7. public abstract class ExtractNode implements Node{...}
  8. ...
  9. @JsonSubTypes({
  10. @JsonSubTypes.Type(value = MongoExtractNode.class, name = "mongoExtract")
  11. })
  12. public interface Node {...}

Step 3:Expand the Sort connector and check whether the corresponding connector already exists in the (InLong Agentinlong-sort/sort-connectors/mongodb-cdc) directory. If you haven’t already, you need to refer to the official flink documentation DataStream Connectors to extend, directly call the existing flink-connector (such asinlong-sort/sort-connectors/mongodb-cdc) or implement the related connector by yourself.

Extend a new load node

There are three steps to extend an LoadNode:

Step 1:Inherit the LoadNode class, the location of the class is:

  1. inlong-sort/sort-common/src/main/java/org/apache/inlong/sort/protocol/node/LoadNode.java

specify the connector in the implemented LoadNode.

  1. // Inherit LoadNode class and implement specific classes, such as KafkaLoadNode
  2. @EqualsAndHashCode(callSuper = true)
  3. @JsonTypeName("kafkaLoad")
  4. @Data
  5. @NoArgsConstructor
  6. public class KafkaLoadNode extends LoadNode implements Serializable {
  7. @Nonnull
  8. @JsonProperty("topic")
  9. private String topic;
  10. ...
  11. @JsonCreator
  12. public KafkaLoadNode(@Nonnull @JsonProperty("topic") String topic, ...) {...}
  13. // configure and use different connectors according to different conditions
  14. @Override
  15. public Map<String, String> tableOptions() {
  16. ...
  17. if (format instanceof JsonFormat || format instanceof AvroFormat || format instanceof CsvFormat) {
  18. if (StringUtils.isEmpty(this.primaryKey)) {
  19. // kafka connector
  20. options.put("connector", "kafka");
  21. options.putAll(format.generateOptions(false));
  22. } else {
  23. options.put("connector", "upsert-kafka"); // upsert-kafka connector
  24. options.putAll(format.generateOptions(true));
  25. }
  26. } else if (format instanceof CanalJsonFormat || format instanceof DebeziumJsonFormat) {
  27. // kafka-inlong connector
  28. options.put("connector", "kafka-inlong");
  29. options.putAll(format.generateOptions(false));
  30. } else {
  31. throw new IllegalArgumentException("kafka load Node format is IllegalArgument");
  32. }
  33. return options;
  34. }
  35. }

Step 2:add the Load to JsonSubTypes in ExtractNode and Node

  1. // add field in JsonSubTypes of LoadNode and Node
  2. ...
  3. @JsonSubTypes({
  4. @JsonSubTypes.Type(value = KafkaLoadNode.class, name = "kafkaLoad")
  5. })
  6. ...
  7. public abstract class LoadNode implements Node{...}
  8. ...
  9. @JsonSubTypes({
  10. @JsonSubTypes.Type(value = KafkaLoadNode.class, name = "kafkaLoad")
  11. })
  12. public interface Node {...}

Step 3:Extend the Sort connector, Kafka’s sort connector is in inlong-sort/sort-connectors/kafka.

Bundle extract node and load node into InLong Sort

To integrate extract and load into the InLong Sort mainstream, you need to implement the semantics mentioned in the overview section: group, stream, node, etc. The entry class of InLong Sort is in :

  1. inlong-sort/sort-core/src/main/java/org/apache/inlong/sort/Entrance.java

How to integrate extract and load into InLong Sort can refer to the following ut. First, build the corresponding extractnode and loadnode, then build noderelation, streaminfo and groupinfo, and finally use FlinkSqlParser to execute.

  1. public class MongoExtractToKafkaLoad extends AbstractTestBase {
  2. // create MongoExtractNode
  3. private MongoExtractNode buildMongoNode() {
  4. List<FieldInfo> fields = Arrays.asList(new FieldInfo("name", new StringFormatInfo()), ...);
  5. return new MongoExtractNode(..., fields, ...);
  6. }
  7. // create KafkaLoadNode
  8. private KafkaLoadNode buildAllMigrateKafkaNode() {
  9. List<FieldInfo> fields = Arrays.asList(new FieldInfo("name", new StringFormatInfo()), ...);
  10. List<FieldRelation> relations = Arrays.asList(new FieldRelation(new FieldInfo("name", new StringFormatInfo()), ...), ...);
  11. CsvFormat csvFormat = new CsvFormat();
  12. return new KafkaLoadNode(..., fields, relations, csvFormat, ...);
  13. }
  14. // create NodeRelation
  15. private NodeRelation buildNodeRelation(List<Node> inputs, List<Node> outputs) {
  16. List<String> inputIds = inputs.stream().map(Node::getId).collect(Collectors.toList());
  17. List<String> outputIds = outputs.stream().map(Node::getId).collect(Collectors.toList());
  18. return new NodeRelation(inputIds, outputIds);
  19. }
  20. // test the main flow: mongodb to kafka
  21. @Test
  22. public void testMongoDbToKafka() throws Exception {
  23. EnvironmentSettings settings = EnvironmentSettings. ... .build();
  24. StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
  25. ...
  26. StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env, settings);
  27. Node inputNode = buildMongoNode();
  28. Node outputNode = buildAllMigrateKafkaNode();
  29. StreamInfo streamInfo = new StreamInfo("1", Arrays.asList(inputNode, outputNode), ...);
  30. GroupInfo groupInfo = new GroupInfo("1", Collections.singletonList(streamInfo));
  31. FlinkSqlParser parser = FlinkSqlParser.getInstance(tableEnv, groupInfo);
  32. ParseResult result = parser.parse();
  33. Assert.assertTrue(result.tryExecute());
  34. }
  35. }