Overview

Multi-Catalog is designed to make it easier to connect to external data catalogs to enhance Doris’s data lake analysis and federated data query capabilities.

In older versions of Doris, user data is in a two-tiered structure: database and table. Thus, connections to external catalogs could only be done at the database or table level. For example, users could create a mapping to a table in an external catalog via create external table, or to a database via create external database . If there were large amounts of databases or tables in the external catalog, users would need to create mappings to them one by one, which could be a heavy workload.

With the advent of Multi-Catalog, Doris now has a new three-tiered metadata hierarchy (catalog -> database -> table), which means users can connect to external data at the catalog level. The currently supported external catalogs include:

  1. Apache Hive
  2. Apache Iceberg
  3. Apache Hudi
  4. Elasticsearch
  5. JDBC
  6. Apache Paimon(Incubating)

Multi-Catalog works as an additional and enhanced external table connection method. It helps users conduct multi-catalog federated queries quickly.

Basic Concepts

  1. Internal Catalog

    Existing databases and tables in Doris are all under the Internal Catalog, which is the default catalog in Doris and cannot be modified or deleted.

  2. External Catalog

    Users can create an External Catalog using the CREATE CATALOG command, and view the existing Catalogs via the SHOW CATALOGS command.

  3. Switch Catalog

    After login, you will enter the Internal Catalog by default. Then, you can view or switch to your target database via SHOW DATABASES and USE DB .

    Example of switching catalog:

    1. SWITCH internal;
    2. SWITCH hive_catalog;

    After switching catalog, you can view or switch to your target database in that catalog via SHOW DATABASES and USE DB . You can view and access data in External Catalogs the same way as doing that in the Internal Catalog.

    Doris only supports read-only access to data in External Catalogs currently.

  4. Delete Catalog

    Databases and tables in External Catalogs are for read only, but External Catalogs are deletable via the DROP CATALOG command. (The Internal Catalog cannot be deleted.)

    The deletion only means to remove the mapping in Doris to the corresponding catalog. It doesn’t change the external catalog itself by all means.

Examples

Connect to Hive

The followings are the instruction on how to connect to a Hive catalog using the Catalog feature.

For more information about Hive, please see Hive.

  1. Create Catalog

    1. CREATE CATALOG hive PROPERTIES (
    2. 'type'='hms',
    3. 'hive.metastore.uris' = 'thrift://172.21.0.1:7004'
    4. );

    Syntax Help: CREATE CATALOG

  2. View Catalog

    View existing Catalogs via the SHOW CATALOGS command:

    1. mysql> SHOW CATALOGS;
    2. +-----------+-------------+----------+-----------+-------------------------+---------------------+------------------------+
    3. | CatalogId | CatalogName | Type | IsCurrent | CreateTime | LastUpdateTime | Comment |
    4. +-----------+-------------+----------+-----------+-------------------------+---------------------+------------------------+
    5. | 10024 | hive | hms | yes | 2023-12-25 16:11:41.687 | 2023-12-25 20:43:18 | NULL |
    6. | 0 | internal | internal | | UNRECORDED | NULL | Doris internal catalog |
    7. +-----------+-------------+----------+-----------+-------------------------+---------------------+------------------------+

    Syntax Help: SHOW CATALOGS

    You can view the CREATE CATALOG statement via SHOW CREATE CATALOG.

    You can modify the Catalog PROPERTIES via ALTER CATALOG.

  3. Switch Catalog

    Switch to the Hive Catalog using the SWITCH command, and view the databases in it:

    1. mysql> SWITCH hive;
    2. Query OK, 0 rows affected (0.00 sec)
    3. mysql> SHOW DATABASES;
    4. +-----------+
    5. | Database |
    6. +-----------+
    7. | default |
    8. | random |
    9. | ssb100 |
    10. | tpch1 |
    11. | tpch100 |
    12. | tpch1_orc |
    13. +-----------+

    Syntax Help: SWITCH

  4. Use the Catalog

    After switching to the Hive Catalog, you can use the relevant features.

    For example, you can switch to Database tpch100, and view the tables in it:

    1. mysql> USE tpch100;
    2. Database changed
    3. mysql> SHOW TABLES;
    4. +-------------------+
    5. | Tables_in_tpch100 |
    6. +-------------------+
    7. | customer |
    8. | lineitem |
    9. | nation |
    10. | orders |
    11. | part |
    12. | partsupp |
    13. | region |
    14. | supplier |
    15. +-------------------+

    You can view the schema of Table lineitem:

    1. mysql> DESC lineitem;
    2. +-----------------+---------------+------+------+---------+-------+
    3. | Field | Type | Null | Key | Default | Extra |
    4. +-----------------+---------------+------+------+---------+-------+
    5. | l_shipdate | DATE | Yes | true | NULL | |
    6. | l_orderkey | BIGINT | Yes | true | NULL | |
    7. | l_linenumber | INT | Yes | true | NULL | |
    8. | l_partkey | INT | Yes | true | NULL | |
    9. | l_suppkey | INT | Yes | true | NULL | |
    10. | l_quantity | DECIMAL(15,2) | Yes | true | NULL | |
    11. | l_extendedprice | DECIMAL(15,2) | Yes | true | NULL | |
    12. | l_discount | DECIMAL(15,2) | Yes | true | NULL | |
    13. | l_tax | DECIMAL(15,2) | Yes | true | NULL | |
    14. | l_returnflag | TEXT | Yes | true | NULL | |
    15. | l_linestatus | TEXT | Yes | true | NULL | |
    16. | l_commitdate | DATE | Yes | true | NULL | |
    17. | l_receiptdate | DATE | Yes | true | NULL | |
    18. | l_shipinstruct | TEXT | Yes | true | NULL | |
    19. | l_shipmode | TEXT | Yes | true | NULL | |
    20. | l_comment | TEXT | Yes | true | NULL | |
    21. +-----------------+---------------+------+------+---------+-------+

    You can perform a query:

    1. mysql> SELECT l_shipdate, l_orderkey, l_partkey FROM lineitem limit 10;
    2. +------------+------------+-----------+
    3. | l_shipdate | l_orderkey | l_partkey |
    4. +------------+------------+-----------+
    5. | 1998-01-21 | 66374304 | 270146 |
    6. | 1997-11-17 | 66374304 | 340557 |
    7. | 1997-06-17 | 66374400 | 6839498 |
    8. | 1997-08-21 | 66374400 | 11436870 |
    9. | 1997-08-07 | 66374400 | 19473325 |
    10. | 1997-06-16 | 66374400 | 8157699 |
    11. | 1998-09-21 | 66374496 | 19892278 |
    12. | 1998-08-07 | 66374496 | 9509408 |
    13. | 1998-10-27 | 66374496 | 4608731 |
    14. | 1998-07-14 | 66374592 | 13555929 |
    15. +------------+------------+-----------+

    Or you can conduct a join query:

    1. mysql> SELECT l.l_shipdate FROM hive.tpch100.lineitem l WHERE l.l_partkey IN (SELECT p_partkey FROM internal.db1.part) LIMIT 10;
    2. +------------+
    3. | l_shipdate |
    4. +------------+
    5. | 1993-02-16 |
    6. | 1995-06-26 |
    7. | 1995-08-19 |
    8. | 1992-07-23 |
    9. | 1998-05-23 |
    10. | 1997-07-12 |
    11. | 1994-03-06 |
    12. | 1996-02-07 |
    13. | 1997-06-01 |
    14. | 1996-08-23 |
    15. +------------+

    The table is identified in the format of catalog.database.table . For example, internal.db1.part in the above snippet.

    If the target table is in the current Database of the current Catalog, catalog and database in the format can be omitted.

    You can use the INSERT INTO command to insert table data from the Hive Catalog into a table in the Internal Catalog. This is how you can import data from External Catalogs to the Internal Catalog:

    1. mysql> SWITCH internal;
    2. Query OK, 0 rows affected (0.00 sec)
    3. mysql> USE db1;
    4. Database changed
    5. mysql> INSERT INTO part SELECT * FROM hive.tpch100.part limit 1000;
    6. Query OK, 1000 rows affected (0.28 sec)
    7. {'label':'insert_212f67420c6444d5_9bfc184bf2e7edb8', 'status':'VISIBLE', 'txnId':'4'}

Column Type Mapping

After you create a Catalog, Doris will automatically synchronize the databases and tables from the corresponding external catalog to it. The following shows how Doris maps different types of catalogs and tables.

As for types that cannot be mapped to a Doris column type, such as UNION and INTERVAL , Doris will map them to an UNSUPPORTED type. Here are examples of queries in a table containing UNSUPPORTED types:

Suppose the table is of the following schema:

  1. k1 INT,
  2. k2 INT,
  3. k3 UNSUPPORTED,
  4. k4 INT
  1. select * from table; // Error: Unsupported type 'UNSUPPORTED_TYPE' in '`k3`
  2. select * except(k3) from table; // Query OK.
  3. select k1, k3 from table; // Error: Unsupported type 'UNSUPPORTED_TYPE' in '`k3`
  4. select k1, k4 from table; // Query OK.

You can find more details of the mapping of various data sources (Hive, Iceberg, Hudi, Elasticsearch, and JDBC) in the corresponding pages.

Privilege Management

When using Doris to access the data in the External Catalog, by default, it relies on Doris’s own permission access management function.

Along with the new Multi-Catalog feature, we also added privilege management at the Catalog level (See Privilege Management for details).

Users can also specify a custom authentication class through the access_controller.class attribute. As specified by:

"access_controller.class" = "org.apache.doris.catalog.authorizer.RangerHiveAccessControllerFactory"

Then you can use Apache Range to perform authentication management on Hive Catalog. For more information see: Hive Catalog

Database synchronizing management

Setting include_database_list and exclude_database_list in Catalog properties to specify databases to synchronize.

include_database_list: Only synchronize the specified databases. split with ,, default is to synchronize all databases. db name is case sensitive.

exclude_database_list: Specify databases that do not need to synchronize. split with ,, default is to synchronize all databases. db name is case sensitive.

When include_database_list and exclude_database_list specify overlapping databases, exclude_database_list would take effect with higher privilege over include_database_list.

To connect JDBC, these two properties should work with only_specified_database, see JDBC for more detail.

Metadata Refresh

By default, metadata changes in external catalogs, such as creating and dropping tables, adding and dropping columns, etc., will not be synchronized to Doris.

Users can refresh metadata in the following ways.

Manual refresh

Users need to manually refresh the metadata through the REFRESH command.

Regular refresh

When creating the catalog, specify the refresh time parameter metadata_refresh_interval_sec in the properties in seconds. If this parameter is set when creating the catalog, the FE master node will refresh the catalog regularly according to the parameter value. Currently three types of catalogs are supported:

  • hive: Hive MetaStore
  • es: Elasticsearch
  • jdbc: standard interface for database access (JDBC)
  1. -- Set the catalog refresh interval to 20 seconds
  2. CREATE CATALOG es PROPERTIES (
  3. "type"="es",
  4. "hosts"="http://127.0.0.1:9200",
  5. "metadata_refresh_interval_sec"="20"
  6. );

Auto Refresh

Auto-refresh currently only supports Hive Catalog.