Metadata Operations and Maintenance
This document focuses on how to manage Doris metadata in a real production environment. It includes the proposed deployment of FE nodes, some commonly used operational methods, and common error resolution methods.
For the time being, read the Doris metadata design document to understand how Doris metadata works.
Important tips
- Current metadata design is not backward compatible. That is, if the new version has a new metadata structure change (you can see whether there is a new VERSION in the `FeMetaVersion. java’file in the FE code), it is usually impossible to roll back to the old version after upgrading to the new version. Therefore, before upgrading FE, be sure to test metadata compatibility according to the operations in the Upgrade Document.
Metadata catalog structure
Let’s assume that the path of meta_dir
specified in fe.conf is path/to/palo-meta
. In a normal Doris cluster, the directory structure of metadata should be as follows:
/path/to/palo-meta/
|-- bdb/
| |-- 00000000.jdb
| |-- je.config.csv
| |-- je.info.0
| |-- je.info.0.lck
| |-- je.lck
| `-- je.stat.csv
`-- image/
|-- ROLE
|-- VERSION
`-- image.xxxx
bdb
We use [bdbje] (https://www.oracle.com/technetwork/database/berkeleydb/overview/index-093405.html) as a distributed kV system to store metadata journal. This BDB directory is equivalent to the “data directory” of bdbje.
The
.jdb
suffix is the data file of bdbje. These data files will increase with the increasing number of metadata journals. When Doris regularly completes the image, the old log is deleted. So normally, the total size of these data files varies from several MB to several GB (depending on how Doris is used, such as import frequency). When the total size of the data file is larger than 10GB, you may need to wonder whether the image failed or the historical journals that failed to distribute the image could not be deleted.je.info.0
is the running log of bdbje. The time in this log is UTC + 0 time zone. We may fix this in a later version. From this log, you can also see how some bdbje works.image directory
The image directory is used to store metadata mirrors generated regularly by Doris. Usually, you will see a
image.xxxxx
mirror file. Wherexxxxx
is a number. This number indicates that the image contains all metadata journal beforexxxx
. And the generation time of this file (viewed throughls -al
) is usually the generation time of the mirror.You may also see a
image.ckpt
file. This is a metadata mirror being generated. Thedu -sh
command should show that the file size is increasing, indicating that the mirror content is being written to the file. When the mirror is written, it automatically renames itself to a newimage.xxxxx
and replaces the old image file.Only FE with a Master role will actively generate image files on a regular basis. After each generation, FE is pushed to other non-Master roles. When it is confirmed that all other FEs have received this image, Master FE deletes the metadata journal in bdbje. Therefore, if image generation fails or image push fails to other FEs, data in bdbje will accumulate.
ROLE
file records the type of FE (FOLLOWER or OBSERVER), which is a text file.VERSION
file records the cluster ID of the Doris cluster and the token used to access authentication between nodes, which is also a text file.ROLE
file andVERSION
file may only exist at the same time, or they may not exist at the same time (e.g. at the first startup).
Basic operations
Start single node FE
Single node FE is the most basic deployment mode. A complete Doris cluster requires at least one FE node. When there is only one FE node, the type of the node is Follower and the role is Master.
First start-up
Suppose the path of
meta_dir
specified in fe.conf ispath/to/palo-meta
.Ensure that
path/to/palo-meta
already exists, that the permissions are correct and that the directory is empty.Start directly through
sh bin/start_fe.sh
.After booting, you should be able to see the following log in fe.log:
- Palo FE starting…
- image does not exist: /path/to/palo-meta/image/image.0
- transfer from INIT to UNKNOWN
- transfer from UNKNOWN to MASTER
- the very first time to open bdb, dbname is 1
- start fencing, epoch number is 1
- finish replay in xxx msec
- QE service start
- thrift server started
The above logs are not necessarily strictly in this order, but they are basically similar.
The first start-up of a single-node FE usually does not encounter problems. If you haven’t seen the above logs, generally speaking, you haven’t followed the document steps carefully, please read the relevant wiki carefully.
Restart
Stopped FE nodes can be restarted by using
sh bin/start_fe.sh
.After restarting, you should be able to see the following log in fe.log:
Palo FE starting…
finished to get cluster id: xxxx, role: FOLLOWER and node name: xxxx
If no image has been generated before reboot, you will see:
image does not exist: /path/to/palo-meta/image/image.0
If an image is generated before the restart, you will see:
start load image from /path/to/palo-meta/image/image.xxx. is ckpt: false
finished load image in xxx ms
transfer from INIT to UNKNOWN
replayed journal id is xxxx, replay to journal id is yyyy
transfer from UNKNOWN to MASTER
finish replay in xxx msec
master finish replay journal, can write now.
begin to generate new image: image.xxxx
start save image to /path/to/palo-meta/image/image.ckpt. is ckpt: true
finished save image /path/to/palo-meta/image/image.ckpt in xxx ms. checksum is xxxx
push image.xxx to other nodes. totally xx nodes, push successed xx nodes
QE service start
thrift server started
The above logs are not necessarily strictly in this order, but they are basically similar.
Common problems
For the deployment of single-node FE, start-stop usually does not encounter any problems. If you have any questions, please refer to the relevant Wiki and check your operation steps carefully.
Add FE
Adding FE processes is described in detail in the [Deployment and Upgrade Documents] (https://github.com/apache/incubator-doris/wiki/Doris-Deploy-%26-Upgrade) and will not be repeated. Here are some points for attention, as well as common problems.
Notes
- Before adding a new FE, make sure that the current Master FE runs properly (connection is normal, JVM is normal, image generation is normal, bdbje data directory is too large, etc.)
- The first time you start a new FE, you must make sure that the
-helper
parameter is added to point to Master FE. There is no need to add-helper
when restarting. (If-helper
is specified, FE will directly ask the helper node for its role. If not, FE will try to obtain information fromROLE
andVERSION
files in thepalo-meta/image/
directory. - The first time you start a new FE, you must make sure that the
meta_dir
of the FE is created, has correct permissions and is empty. - Starting a new FE and executing the
ALTER SYSTEM ADD FOLLOWER/OBSERVER
statement adds FE to metadata in a sequence that is not required. If a new FE is started first and no statement is executed, thecurrent node is not added to the group. Please add it first.
in the new FE log. When the statement is executed, it enters the normal process. - Make sure that after the previous FE is added successfully, the next FE is added.
- 建议直接连接到 MASTER FE 执行
ALTER SYSTEM ADD FOLLOWER/OBSERVER
语句。
Common problems
this need is DETACHED
When you first start a FE to be added, if the data in palo-meta/bdb on Master FE is large, you may see the words
this node is DETACHED
. in the FE log to be added. At this point, bdbje is copying data, and you can see that thebdb/
directory of FE to be added is growing. This process usually takes several minutes (depending on the amount of data in bdbje). Later, there may be some bdbje-related error stack information in fe. log. IfQE service start
andthrift server start
are displayed in the final log, the start is usually successful. You can try to connect this FE via mysql-client. If these words do not appear, it may be the problem of bdbje replication log timeout. At this point, restarting the FE directly will usually solve the problem.Failure to add due to various reasons
If OBSERVER is added, because OBSERVER-type FE does not participate in the majority of metadata writing, it can theoretically start and stop at will. Therefore, for the case of adding OBSERVER failure. The process of OBSERVER FE can be killed directly. After clearing the metadata directory of OBSERVER, add the process again.
If FOLLOWER is added, because FOLLOWER is mostly written by participating metadata. So it is possible that FOLLOWER has joined the bdbje electoral team. If there are only two FOLLOWER nodes (including MASTER), then stopping one FE may cause another FE to quit because it cannot write most of the time. At this point, we should first delete the newly added FOLLOWER node from the metadata through the
ALTER SYSTEM DROP FOLLOWER
command, then kill the FOLLOWER process, empty the metadata and re-add the process.
Delete FE
The corresponding type of FE can be deleted by the ALTER SYSTEM DROP FOLLOWER/OBSERVER
command. The following points should be noted:
For OBSERVER type FE, direct DROP is enough, without risk.
For FOLLOWER type FE. First, you should make sure that you start deleting an odd number of FOLLOWERs (three or more).
- If the FE of non-MASTER role is deleted, it is recommended to connect to MASTER FE, execute DROP command, and then kill the process.
- If you want to delete MASTER FE, first confirm that there are odd FOLLOWER FE and it works properly. Then kill the MASTER FE process first. At this point, a FE will be elected MASTER. After confirming that the remaining FE is working properly, connect to the new MASTER FE and execute the DROP command to delete the old MASTER FE.
Advanced Operations
Failure recovery
FE may fail to start bdbje and synchronize between FEs for some reasons. Phenomena include the inability to write metadata, the absence of MASTER, and so on. At this point, we need to manually restore the FE. The general principle of manual recovery of FE is to start a new MASTER through metadata in the current meta_dir
, and then add other FEs one by one. Please follow the following steps strictly:
First, stop all FE processes and all business access. Make sure that during metadata recovery, external access will not lead to other unexpected problems.
Identify which FE node’s metadata is up-to-date:
- First of all, be sure to back up all FE’s
meta_dir
directories first. - Usually, Master FE’s metadata is up to date. You can see the suffix of image.xxxx file in the
meta_dir/image
directory. The larger the number, the newer the metadata. - Usually, by comparing all FOLLOWER FE image files, you can find the latest metadata.
- After that, we use the FE node with the latest metadata to recover.
- If using metadata of OBSERVER node to recover will be more troublesome, it is recommended to choose FOLLOWER node as far as possible.
- First of all, be sure to back up all FE’s
The following operations are performed on the FE nodes selected in step 2.
- If the node is an OBSERVER, first change the
role=OBSERVER
in themeta_dir/image/ROLE
file torole=FOLLOWER
. (Recovery from the OBSERVER node will be more cumbersome, first follow the steps here, followed by a separate description) - Add configuration in fe.conf:
metadata_failure_recovery=true
. - Run
sh bin/start_fe.sh
to start the FE - If normal, the FE will start in the role of MASTER, similar to the description in the previous section
Start a single node FE
. You should see the wordstransfer from XXXX to MASTER
in fe.log. - After the start-up is completed, connect to the FE first, and execute some query imports to check whether normal access is possible. If the operation is not normal, it may be wrong. It is recommended to read the above steps carefully and try again with the metadata previously backed up. If not, the problem may be more serious.
- If successful, through the
show frontends;
command, you should see all the FEs you added before, and the current FE is master. - Delete the
metadata_failure_recovery=true
configuration item in fe.conf, or set it tofalse
, and restart the FE (Important).
If you are recovering metadata from an OBSERVER node, after completing the above steps, you will find that the current FE role is OBSERVER, but
IsMaster
appears astrue
. This is because the “OBSERVER” seen here is recorded in Doris’s metadata, but whether it is master or not, is recorded in bdbje’s metadata. Because we recovered from an OBSERVER node, there was inconsistency. Please take the following steps to fix this problem (we will fix it in a later version):- First, all FE nodes except this “OBSERVER” are DROPed out.
- A new FOLLOWER FE is added through the
ADD FOLLOWER
command, assuming that it is on hostA. - Start a new FE on hostA and join the cluster by
helper
. - After successful startup, you should see two FEs through the
show frontends;
statement, one is the previous OBSERVER, the other is the newly added FOLLOWER, and the OBSERVER is the master. - After confirming that the new FOLLOWER is working properly, the new FOLLOWER metadata is used to perform a failure recovery operation again.
- The purpose of the above steps is to manufacture a metadata of FOLLOWER node artificially, and then use this metadata to restart fault recovery. This avoids inconsistencies in recovering metadata from OBSERVER.
The meaning of
metadata_failure_recovery = true
is to empty the metadata ofbdbje
. In this way, bdbje will not contact other FEs before, but start as a separate FE. This parameter needs to be set to true only when restoring startup. After recovery, it must be set to false. Otherwise, once restarted, the metadata of bdbje will be emptied again, which will make other FEs unable to work properly.- If the node is an OBSERVER, first change the
After the successful execution of step 3, we delete the previous FEs from the metadata by using the
ALTER SYSTEM DROP FOLLOWER/OBSERVER
command and add them again by adding new FEs.If the above operation is normal, it will be restored.
FE type change
If you need to change the existing FOLLOWER/OBSERVER type FE to OBSERVER/FOLLOWER type, please delete FE in the way described above, and then add the corresponding type FE.
FE Migration
If you need to migrate one FE from the current node to another, there are several scenarios.
FOLLOWER, or OBSERVER migration for non-MASTER nodes
After adding a new FOLLOWER / OBSERVER directly, delete the old FOLLOWER / OBSERVER.
Single-node MASTER migration
When there is only one FE, refer to the
Failure Recovery
section. Copy the palo-meta directory of FE to the new node and start the new MASTER in Step 3 of theFailure Recovery
sectionA set of FOLLOWER migrates from one set of nodes to another set of new nodes
Deploy FE on the new node and add the new node first by adding FOLLOWER. The old nodes can be dropped by DROP one by one. In the process of DROP-by-DROP, MASTER automatically selects the new FOLLOWER node.
Replacement of FE port
FE currently has the following ports
- Ed_log_port: bdbje’s communication port
- http_port: http port, also used to push image
- rpc_port:FE 的 thrift server port
- query_port: Mysql connection port
edit_log_port
If this port needs to be replaced, it needs to be restored with reference to the operations in the
Failure Recovery
section. Because the port has been persisted into bdbje’s own metadata (also recorded in Doris’s own metadata), it is necessary to clear bdbje’s metadata by settingmetadata_failure_recovery=true
.http_port
All FE http_ports must be consistent. So if you want to modify this port, all FEs need to be modified and restarted. Modifying this port will be more complex in the case of multiple FOLLOWER deployments (involving laying eggs and laying hens…), so this operation is not recommended. If necessary, follow the operation in the
Failure Recovery
section directly.rpc_port
After modifying the configuration, restart FE directly. Master FE informs BE of the new port through heartbeat. Only this port of Master FE will be used. However, it is still recommended that all FE ports be consistent.
query_port
After modifying the configuration, restart FE directly. This only affects mysql’s connection target.
Best Practices
The deployment recommendation of FE is described in the Installation and Deployment Document. Here are some supplements.
If you don’t know the operation logic of FE metadata very well, or you don’t have enough experience in the operation and maintenance of FE metadata, we strongly recommend that only one FOLLOWER-type FE be deployed as MASTER in practice, and the other FEs are OBSERVER, which can reduce many complex operation and maintenance problems. Don’t worry too much about the failure of MASTER single point to write metadata. First, if you configure it properly, FE as a java process is very difficult to hang up. Secondly, if the MASTER disk is damaged (the probability is very low), we can also use the metadata on OBSERVER to recover manually through
fault recovery
.The JVM of the FE process must ensure sufficient memory. We strongly recommend that FE’s JVM memory should be at least 10GB and 32GB to 64GB. And deploy monitoring to monitor JVM memory usage. Because if OOM occurs in FE, metadata writing may fail, resulting in some failures that cannot recover!
FE nodes should have enough disk space to prevent the excessive metadata from causing insufficient disk space. At the same time, FE logs also take up more than a dozen gigabytes of disk space.
Other common problems
fe.log 中一直滚动
meta out of date. current time: xxx, synchronized time: xxx, has log: xxx, fe type: xxx
This is usually because the FE cannot elect Master. For example, if three FOLLOWERs are configured, but only one FOLLOWER is started, this FOLLOWER will cause this problem. Usually, just start the remaining FOLLOWER. If the problem has not been solved after the start-up, manual recovery may be required in accordance with the way in the
Failure Recovery
section.Clock delta: xxxx ms. between Feeder: xxxx and this Replica exceeds max permissible delta: xxxx ms.
Bdbje requires that clock errors between nodes should not exceed a certain threshold. If exceeded, the node will exit abnormally. The default threshold is 5000ms, which is controlled by FE parameter `max_bdbje_clock_delta_ms’, and can be modified as appropriate. But we suggest using NTP and other clock synchronization methods to ensure the clock synchronization of Doris cluster hosts.
Mirror files in the
image/
directory have not been updated for a long timeMaster FE generates a mirror file by default for every 50,000 metadata journal. In a frequently used cluster, a new image file is usually generated every half to several days. If you find that the image file has not been updated for a long time (for example, more than a week), you can see the reasons in sequence as follows:
Search for
memory is not enough to do checkpoint. Committed memroy XXXX Bytes, used memory XXXX Bytes.
in the fe.log of Master FE. If found, it indicates that the current FE’s JVM memory is insufficient for image generation (usually we need to reserve half of the FE memory for image generation). Then you need to add JVM memory and restart FE before you can observe. Each time Master FE restarts, a new image is generated directly. This restart method can also be used to actively generate new images. Note that if there are multiple FOLLOWER deployments, then when you restart the current Master FE, another FOLLOWER FE will become MASTER, and subsequent image generation will be the responsibility of the new Master. Therefore, you may need to modify the JVM memory configuration of all FOLLOWER FE.Search for
begin to generate new image: image.xxxx
in the fe.log of Master FE. If it is found, then the image is generated. Check the subsequent log of this thread, and ifcheckpoint finished save image.xxxx
appears, the image is written successfully. IfException when generating new image file
occurs, the generation fails and specific error messages need to be viewed.
The size of the
bdb/
directory is very large, reaching several Gs or more.The BDB directory will remain large for some time after eliminating the error that the new image cannot be generated. Maybe it’s because Master FE failed to push image. You can search
push image.XXXX to other nodes. totally XX nodes, push successed YY nodes
in the fe. log of Master FE. If YY is smaller than xx, then some FEs are not pushed successfully. You can see the specific errorException when pushing image file.url = xxx
in the fe. log.At the same time, you can add the configuration in the FE configuration file:
edit_log_roll_num = xxxx
. This parameter sets the number of metadata journals and makes an image once. The default is 50000. This number can be reduced appropriately to make images more frequent, thus speeding up the deletion of old journals.FOLLOWER FE hangs up one after another
Because Doris’s metadata adopts the majority writing strategy, that is, a metadata journal must be written to at least a number of FOLLOWER FEs (for example, three FOLLOWERs, two must be written successfully) before it can be considered successful. If the write fails, the FE process exits on its own initiative. So suppose there are three FOLLOWERs: A, B and C. C hangs up first, and then B hangs up, then A will hang up. So as described in the
Best Practices
section, if you don’t have extensive experience in metadata operations and maintenance, it’s not recommended to deploy multiple FOLLOWERs.fe.log 中出现
get exception when try to close previously opened bdb database. ignore it
If there is the word
ignore it
behind it, there is usually no need to deal with it. If you are interested, you can search for this error inBDBEnvironment.java
, and see the annotations.From
show frontends;
Look, theJoin
of a FE is listed astrue
, but actually the FE is abnormal.Through
show frontends;
see theJoin
information. If the column istrue
, it only means that the FE has joined the cluster. It does not mean that it still exists normally in the cluster. Iffalse
, it means that the FE has never joined the cluster.Configuration of FE
master_sync_policy
,replica_sync_policy
, andtxn_rollback_limit.
master_sync_policy
is used to specify whether fsync (),replica_sync_policy
is called when Leader FE writes metadata log, andreplica_sync_policy
is used to specify whether other Follower FE calls fsync () when FE HA deploys synchronous metadata. In earlier versions of Oris, these two parameters defaulted toWRITE_NO_SYNC
, i.e., fsync () was not called. In the latest version of Oris, the default has been changed toSYNC
, that is, fsync () is called. Calling fsync () significantly reduces the efficiency of metadata disk writing. In some environments, IOPS may drop to several hundred and the latency increases to 2-3ms (but it’s still enough for Doris metadata manipulation). Therefore, we recommend the following configuration:- For a single Follower FE deployment,
master_sync_policy
is set toSYNC
, which prevents the loss of metadata due to the downtime of the FE system. - For multi-Follower FE deployment, we can set
master_sync_policy
andreplica_sync_policy
toWRITE_NO_SYNC
, because we think that the probability of simultaneous outage of multiple systems is very low.
If
master_sync_policy
is set toWRITE_NO_SYNC
in a single Follower FE deployment, then a FE system outage may occur, resulting in loss of metadata. At this point, if other Observer FE attempts to restart, it may report an error:Node xxx must rollback xx total commits(numPassedDurableCommits of which were durable) to the earliest point indicated by transaction xxxx in order to rejoin the replication group, but the transaction rollback limit of xxx prohibits this.
- For a single Follower FE deployment,
This means that some transactions that have been persisted need to be rolled back, but the number of entries exceeds the upper limit. Here our default upper limit is 100, which can be changed by setting txn_rollback_limit
. This operation is only used to attempt to start FE normally, but lost metadata cannot be recovered.