Maintenance - Metadata Operations and Maintenance - 《Apache Doris v3.0 Documentation》

Important tips
Metadata catalog structure
Basic operations
Advanced Operations
Best Practices
Other common problems

danger

Avoid using metadata_failure_recovery unless absolutely necessary. Using it may cause metadata truncation, loss, and split-brains. Use cautiously to prevent irreversible data damage from improper operations.

This document focuses on how to manage Doris metadata in a real production environment. It includes the proposed deployment of FE nodes, some commonly used operational methods, and common error resolution methods.

For the time being, read the Doris metadata design document to understand how Doris metadata works.

Important tips

Current metadata design is not backward compatible. That is, if the new version has a new metadata structure change (you can see whether there is a new VERSION in the FeMetaVersion.java file in the FE code), it is usually impossible to roll back to the old version after upgrading to the new version. Therefore, before upgrading FE, be sure to test metadata compatibility according to the operations in the Upgrade Document.

Metadata catalog structure

Let’s assume that the path of meta_dir specified in fe.conf is path/to/doris-meta. In a normal Doris cluster, the directory structure of metadata should be as follows:

/path/to/doris-meta/
            |-- bdb/
            |   |-- 00000000.jdb
            |   |-- je.config.csv
            |   |-- je.info.0
            |   |-- je.info.0.lck
            |   |-- je.lck
            |   `-- je.stat.csv
            `-- image/
                |-- ROLE
                |-- VERSION
                `-- image.xxxx

bdb

We use bdbje as a distributed kV system to store metadata journal. This BDB directory is equivalent to the “data directory” of bdbje.

The .jdb suffix is the data file of bdbje. These data files will increase with the increasing number of metadata journals. When Doris regularly completes the image, the old log is deleted. So normally, the total size of these data files varies from several MB to several GB (depending on how Doris is used, such as import frequency). When the total size of the data file is larger than 10GB, you may need to wonder whether the image failed or the historical journals that failed to distribute the image could not be deleted.

je.info.0 is the running log of bdbje. The time in this log is UTC + 0 time zone. From this log, you can also see how some bdbje works.
image directory

The image directory is used to store metadata mirrors generated regularly by Doris. Usually, you will see a image.xxxxx mirror file. Where xxxxx is a number. This number indicates that the image contains all metadata journal before xxxx. And the generation time of this file (viewed through ls -al) is usually the generation time of the mirror.

You may also see a image.ckpt file. This is a metadata mirror being generated. The du -sh command should show that the file size is increasing, indicating that the mirror content is being written to the file. When the mirror is written, it automatically renames itself to a new image.xxxxx and replaces the old image file.

Only FE with a Master role will actively generate image files on a regular basis. After each generation, FE is pushed to other non-Master roles. When it is confirmed that all other FEs have received this image, Master FE deletes the metadata journal in bdbje. Therefore, if image generation fails or image push fails to other FEs, data in bdbje will accumulate.

ROLE file records the type of FE (FOLLOWER or OBSERVER), which is a text file.

VERSION file records the cluster ID of the Doris cluster and the token used to access authentication between nodes, which is also a text file.

ROLE file and VERSION file may only exist at the same time, or they may not exist at the same time (e.g. at the first startup).

Basic operations

Start single node FE

Single node FE is the most basic deployment mode. A complete Doris cluster requires at least one FE node. When there is only one FE node, the type of the node is Follower and the role is Master.

First start-up
1. Suppose the path of meta_dir specified in fe.conf is path/to/doris-meta.
2. Ensure that path/to/doris-meta already exists, that the permissions are correct and that the directory is empty.
3. Start directly through bash bin/start_fe.sh --daemon.
4. After booting, you should be able to see the following log in fe.log:
  - Palo FE starting…
  - image does not exist: /path/to/doris-meta/image/image.0
  - transfer from INIT to UNKNOWN
  - transfer from UNKNOWN to MASTER
  - the very first time to open bdb, dbname is 1
  - start fencing, epoch number is 1
  - finish replay in xxx msec
  - QE service start
  - thrift server started
  The above logs are not necessarily strictly in this order, but they are basically similar.
5. The first start-up of a single-node FE usually does not encounter problems. If you haven’t seen the above logs, generally speaking, you haven’t followed the document steps carefully, please read the relevant wiki carefully.
Restart
1. Stopped FE nodes can be restarted by using bash bin/start_fe.sh.
2. After restarting, you should be able to see the following log in fe.log:
  - Palo FE starting…
  - finished to get cluster id: xxxx, role: FOLLOWER and node name: xxxx
  - If no image has been generated before reboot, you will see:
  - image does not exist: /path/to/doris-meta/image/image.0
  - If an image is generated before the restart, you will see:
  - start load image from /path/to/doris-meta/image/image.xxx. is ckpt: false
  - finished load image in xxx ms
  - transfer from INIT to UNKNOWN
  - replayed journal id is xxxx, replay to journal id is yyyy
  - transfer from UNKNOWN to MASTER
  - finish replay in xxx msec
  - master finish replay journal, can write now.
  - begin to generate new image: image.xxxx
  - start save image to /path/to/doris-meta/image/image.ckpt. is ckpt: true
  - finished save image /path/to/doris-meta/image/image.ckpt in xxx ms. checksum is xxxx
  - push image.xxx to other nodes. totally xx nodes, push succeeded xx nodes
  - QE service start
  - thrift server started
  The above logs are not necessarily strictly in this order, but they are basically similar.
Common problems

For the deployment of single-node FE, start-stop usually does not encounter any problems. If you have any questions, please refer to the relevant Wiki and check your operation steps carefully.

Add FE

Adding FE processes is described in detail in the Elastic Expansion Documents and will not be repeated. Here are some points for attention, as well as common problems.

Notes
- Before adding a new FE, make sure that the current Master FE runs properly (connection is normal, JVM is normal, image generation is normal, bdbje data directory is too large, etc.)
- The first time you start a new FE, you must make sure that the --helper parameter is added to point to Master FE. There is no need to add --helper when restarting. (If --helper is specified, FE will directly ask the helper node for its role. If not, FE will try to obtain information from ROLE and VERSION files in the doris-meta/image/ directory.
- The first time you start a new FE, you must make sure that the meta_dir of the FE is created, has correct permissions and is empty.
- Starting a new FE and executing the ALTER SYSTEM ADD FOLLOWER/OBSERVER statement adds FE to metadata in a sequence that is not required. If a new FE is started first and no statement is executed, the current node is not added to the group. Please add it first. in the new FE log. When the statement is executed, it enters the normal process.
- Make sure that after the previous FE is added successfully, the next FE is added.
- Connect to MASTER FE and execute ALTER SYSTEM ADD FOLLOWER/OBSERVER stmt.
Common problems
1. this need is DETACHED
  
  When you first start a FE to be added, if the data in doris-meta/bdb on Master FE is large, you may see the words this node is DETACHED. in the FE log to be added. At this point, bdbje is copying data, and you can see that the bdb/ directory of FE to be added is growing. This process usually takes several minutes (depending on the amount of data in bdbje). Later, there may be some bdbje-related error stack information in fe. log. If QE service start and thrift server start are displayed in the final log, the start is usually successful. You can try to connect this FE via mysql-client. If these words do not appear, it may be the problem of bdbje replication log timeout. At this point, restarting the FE directly will usually solve the problem.
2. Failure to add due to various reasons
  - If OBSERVER is added, because OBSERVER-type FE does not participate in the majority of metadata writing, it can theoretically start and stop at will. Therefore, for the case of adding OBSERVER failure. The process of OBSERVER FE can be killed directly. After clearing the metadata directory of OBSERVER, add the process again.
  - If FOLLOWER is added, because FOLLOWER is mostly written by participating metadata. So it is possible that FOLLOWER has joined the bdbje electoral team. If there are only two FOLLOWER nodes (including MASTER), then stopping one FE may cause another FE to quit because it cannot write most of the time. At this point, we should first delete the newly added FOLLOWER node from the metadata through the ALTER SYSTEM DROP FOLLOWER command, then kill the FOLLOWER process, empty the metadata and re-add the process.

Delete FE

The corresponding type of FE can be deleted by the ALTER SYSTEM DROP FOLLOWER/OBSERVER command. The following points should be noted:

For OBSERVER type FE, direct DROP is enough, without risk.
For FOLLOWER type FE. First, you should make sure that you start deleting an odd number of FOLLOWERs (three or more).
1. If the FE of non-MASTER role is deleted, it is recommended to connect to MASTER FE, execute DROP command, and then kill the process.
2. If you want to delete MASTER FE, first confirm that there are odd FOLLOWER FE and it works properly. Then kill the MASTER FE process first. At this point, a FE will be elected MASTER. After confirming that the remaining FE is working properly, connect to the new MASTER FE and execute the DROP command to delete the old MASTER FE.

Advanced Operations

FE Metadata Recovery Mode

Improper use or incorrect operations of the metadata recovery mode can lead to irreversible data damage in the production environment. Therefore, documentation for operating the metadata recovery mode is no longer provided. If there is a genuine need, please contact the developers in the Doris community for assistance.

FE type change

If you need to change the existing FOLLOWER/OBSERVER type FE to OBSERVER/FOLLOWER type, please delete FE in the way described above, and then add the corresponding type FE.

FE Migration

If you need to migrate one FE from the current node to another, there are several scenarios.

FOLLOWER, or OBSERVER migration for non-MASTER nodes

After adding a new FOLLOWER / OBSERVER directly, delete the old FOLLOWER / OBSERVER.
Single-node MASTER migration

If you are a developer, you can perform operations using the metadata recovery mode. However, if you are a user, it is not recommended to use the metadata recovery mode It is suggested to transfer data by rebuilding the environment and using external tables.
A set of FOLLOWER migrates from one set of nodes to another set of new nodes

Deploy FE on the new node and add the new node first by adding FOLLOWER. The old nodes can be dropped by DROP one by one. In the process of DROP-by-DROP, MASTER automatically selects the new FOLLOWER node.

Replacement of FE port

FE currently has the following ports

Ed_log_port: bdbje’s communication port
http_port: http port, also used to push image
rpc_port: thrift server port of Frontend
query_port: Mysql connection port
arrow_flight_sql_port: Arrow Flight SQL connection port

edit_log_port

If this port needs to be replaced, if multiple fe nodes are deployed, you can delete the old node and add the new node by node management step. if it is a single node, you can migrate a single Master fe node refer to “Single-node MASTER migration” in the above
http_port

All FE http_ports must be consistent. So if you want to modify this port, all FEs need to be stop, then be modified and restarted at the same time.
rpc_port

After modifying the configuration, restart FE directly. Master FE informs BE of the new port through heartbeat. Only this port of Master FE will be used. However, it is still recommended that all FE ports be consistent.
query_port

After modifying the configuration, restart FE directly. This only affects mysql’s connection target.
arrow_flight_sql_port

After modifying the configuration, restart FE directly. This only affects arrow flight sql server connection target.

View data in BDBJE (only used by debug)

The metadata log of FE is stored in BDBJE in the form of Key-Value. In some abnormal situations, FE may not be started due to metadata errors. In this case, Doris provides a way to help users query the data stored in BDBJE to facilitate troubleshooting.

First, you need to add configuration in fe.conf: enable_bdbje_debug_mode=true, and then start FE through bash start_fe.sh --daemon.

At this time, FE will enter the debug mode, only start the http server and MySQL server, and open the BDBJE instance, but will not load any metadata and other subsequent startup processes.

This is, we can view the data stored in BDBJE by visiting the web page of FE, or after connecting to Doris through the MySQL client, through show proc "/bdbje";.

mysql> show proc "/bdbje";
+----------+---------------+---------+
| DbNames  | JournalNumber | Comment |
+----------+---------------+---------+
| 110589   | 4273          |         |
| epochDB  | 4             |         |
| metricDB | 430694        |         |
+----------+---------------+---------+

The first level directory will display all the database names in BDBJE and the number of entries in each database.

mysql> show proc "/bdbje/110589";
+-----------+
| JournalId |
+-----------+
| 1         |
| 2         |
...
| 114858    |
| 114859    |
| 114860    |
| 114861    |
+-----------+
4273 rows in set (0.06 sec)

Entering the second level, all the entry keys under the specified database will be listed.

mysql> show proc "/bdbje/110589/114861";
+-----------+--------------+---------------------------------------------+
| JournalId | OpType       | Data                                        |
+-----------+--------------+---------------------------------------------+
| 114861    | OP_HEARTBEAT | org.apache.doris.persist.HbPackage@6583d5fb |
+-----------+--------------+---------------------------------------------+
1 row in set (0.05 sec)

The third level can display the value information of the specified key.

Best Practices

The deployment recommendation of FE is described in the Installation and Deployment Document. Here are some supplements.

If you don’t know the operation logic of FE metadata very well, or you don’t have enough experience in the operation and maintenance of FE metadata, we strongly recommend that only one FOLLOWER-type FE be deployed as MASTER in practice, and the other FEs are OBSERVER, which can reduce many complex operation and maintenance problems. Don’t worry too much about the failure of MASTER single point to write metadata. First, if you configure it properly, FE as a java process is very difficult to hang up. Secondly, if the MASTER disk is damaged (the probability is very low), we can also use the metadata on OBSERVER to recover manually through metadata recovery mode.
The JVM of the FE process must ensure sufficient memory. We strongly recommend that FE’s JVM memory should be at least 10GB and 32GB to 64GB. And deploy monitoring to monitor JVM memory usage. Because if OOM occurs in FE, metadata writing may fail, resulting in some failures that cannot recover!
FE nodes should have enough disk space to prevent the excessive metadata from causing insufficient disk space. At the same time, FE logs also take up more than a dozen gigabytes of disk space.

Metadata Operations and Maintenance