Configuration data and persistent storage

In a Doris cluster, components including FE, BE, CN, and monitoring components all need to persist data to physical storage. Kubernetes provides Persistent Volumes the ability to persist data to physical storage. In a Kubernetes environment, there are two main types of Persistent Volumes:

  • Local PV storage (Local Persistent Volumes): Local PV is where Kubernetes directly uses the local disk directory of the host to persistently store container data. Local PV provides smaller network latency and can provide better read and write capabilities when using high-performance hard drives such as SSDs. Since the local PV is bound to the host, when the host fails, the local PV undergoes fault drift.
  • Network PV storage (Network Persistent Volumes): Network PV is a storage resource accessed through the network. The network PV can be accessed by any node in the cluster. When the host fails, the network PV can be mounted to other nodes and continued to be used.

StorageClass can be used to define the type and behavior of PV. StorageClass can decouple disk resources from containers to achieve data persistence and reliability. In Doris Operator, deploying Doris on Kubernetes can support local PV and network PV, and you can choose according to business needs.

Configuring Doris Cluster - 图1Warning

It is recommended to persist data to storage at deployment time. If PersistentVolumeClaim is not configured during deployment, Doris Operator will use emptyDir mode by default to store metadata, data, and logs. When the pod is restarted, related data will be lost.

Persistence directory type

In Doris, the following directories are recommended for persistent storage:

  • FE node: doris-meta, log
  • BE node: storage, log
  • CN node: storage, log
  • Broker node: log

There are multiple log types in Doris, such as INFO log, OUT log, GC log and audit log. Doris Operator can output logs to the console and the specified directory at the same time. If the user’s Kubernetes has complete log collection capabilities, Doris’ INFO logs can be collected through console output. It is recommended that all Doris logs be persisted to the designated storage through PVC configuration, which will help locate and troubleshoot problems.

Data persistence to network PV

Doris Operator uses Kubernetes’ default StorageClass to support FE and BE storage. In the CR of DorisCluster, the specified network PV can be configured by modifying the StorageClass to specify persistentVolumeClaimSpec.storageClassName.

  1. persistentVolumes:
  2. - mountPath: /opt/apache-doris/fe/doris-meta
  3. name: storage0
  4. persistentVolumeClaimSpec:
  5. # When use specific storageclass, the storageClassName should reConfig, example as annotation.
  6. storageClassName: ${your_storageclass}
  7. accessModes:
  8. - ReadWriteOnce
  9. resources:
  10. # notice: if the storage size is less than 5G, fe will not start normal.
  11. requests:
  12. storage: 100Gi

FE configuration persistent storage

When deploying a cluster, it is recommended to provide persistent storage for the doris-meta and log directories in FE. Doris-meta users store metadata, usually from a few hundred MB to dozens of GB. It is recommended to reserve 100GB. The log directory is used to store FE logs. It is generally recommended to reserve 50GB.

In the following example, FE uses StorageClass to mount metadata storage and log storage:

  1. feSpec:
  2. persistentVolumes:
  3. - name: fe-meta
  4. mountPath: /opt/apache-doris/fe/doris-meta
  5. persistentVolumeClaimSpec:
  6. storageClassName: ${storageClassName}
  7. accessModes:
  8. - ReadWriteOnce
  9. resources:
  10. requests:
  11. Storage: 50Gi
  12. - name: fe-log
  13. mountPath: /opt/apache-doris/fe/log
  14. persistentVolumeClaimSpec:
  15. storageClassName: ${storageClassName}
  16. accessModes:
  17. - ReadWriteOnce
  18. resources:
  19. requests:
  20. storage: 100Gi

Among them, the name of StorageClass needs to be specified in ${storageClassName}. You can use the following command to view the StorageClass supported in the current Kubernetes cluster:

  1. kubectl get sc

The return result is as follows:

  1. NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
  2. openebs-hostpath openebs.io/local Delete WaitForFirstConsumer false 212d
  3. openebs-device openebs.io/local Delete WaitForFirstConsumer false 212d
  4. openebs-jiva-csi-default jiva.csi.openebs.io Delete Immediate true 212d
  5. local-storage kubernetes.io/no-provisioner Delete WaitForFirstConsumer false 149d
  6. microk8s-hostpath (default) microk8s.io/hostpath Delete Immediate false 219d
  7. doris-storage openebs.io/local Delete WaitForFirstConsumer false 54d

Configuring Doris Cluster - 图2Tip

The default metadata path and log path can be modified by configuring ConfigMap:

  1. The mounthPath configuration of fe-meta needs to be consistent with the meta_dir variable configuration path in ConfigMap. By default, metadata will be written to the /opt/apache-doris/fe/doris-meta directory;
  2. The mounthPath configuration of fe-log needs to be consistent with the LOG_DIR variable path in ConfigMap. By default, log data will be written to the /opt/apache-doris/fe/log directory.

BE configuration persistent storage

When deploying a cluster, it is recommended that the storage and log directories in BE be used for persistent storage. Storage users store data, which needs to be measured based on the amount of business data. The log directory is used to store FE logs. It is generally recommended to reserve 50GB.

In the following example, BE uses StorageClass to mount the data storage and log storage:

  1. beSpec:
  2. persistentVolumes:
  3. - mountPath: /opt/apache-doris/be/storage
  4. name: be-storage
  5. persistentVolumeClaimSpec:
  6. storageClassName: {storageClassName}
  7. accessModes:
  8. - ReadWriteOnce
  9. resources:
  10. requests:
  11. Storage: 1Ti
  12. - mountPath: /opt/apache-doris/be/log
  13. name: belog
  14. persistentVolumeClaimSpec:
  15. storageClassName: {storageClassName}
  16. accessModes:
  17. - ReadWriteOnce
  18. resources:
  19. requests:
  20. storage: 100Gi

Cluster deployment configuration

Cluster name

The cluster name can be configured by modifying metadata.name in DorisCluster Custom Resource.

Mirror version

When deploying a Doris cluster, you can specify the cluster version. When deploying a cluster, you should ensure that the versions of each component in the cluster are consistent. Configure the version of each component by modifying spec.{feSpec|beSpec}.image.

Cluster topology

Before deploying a Doris cluster, you need to plan the topology of the cluster based on your business. The number of nodes of each component can be configured by modifying spec.{feSpec|beSpec}.replicas. Based on the principle of high data availability of production nodes, Doris Operator stipulates that there are at least 3 nodes in the Kubernetes cluster in the cluster. At the same time, in order to ensure the availability of the cluster, it is recommended to deploy at least 3 FE and BE nodes.

Service configuration

Kubernetes provides different Serivce methods to expose Doris’s external access interface, such as ClusterIP, NodePort, LoadBalancer, etc.

ClusterIP

A service of type ClusterIP will create a virtual IP inside the cluster. It can only be accessed within the Kubernetes cluster through ClusterIP and is not visible to the outside world. In Doris Custom Resource, the ClusterIP type Service is used by default.

NodePort

Can be exposed via NodePort when LoadBalancer is not available. NodePort exposes services through the node’s IP and static port. A NodePort service can be accessed from outside the cluster by requesting NodeIP + NodePort.

  1. ...
  2. feSpec:
  3. replicas: 3
  4. service:
  5. type: NodePort
  6. ...
  7. beSpec:
  8. replicas: 3
  9. service:
  10. type: NodePort
  11. ...

Cluster parameter configuration

Doris uses ConfigMap in Kubernetes to decouple configuration files and services. All nodes of the Doris component use ConfigMap as unified configuration management in Kubernetes, and all nodes of the component are started with the same configuration information. Doris’ system parameters are stored in ConfigMap using key-value pairs. When deploying a doris cluster, you need to deploy ConfigMap under the same namespace in advance.

In the CR of Doris Cluster, provide ConfigMapInfo definitions to mount configuration information for each component. ConfigMapInfo contains two variables:

  • ConfigMapName represents the name of the ConfigMap you want to use
  • ResolveKey represents the corresponding configuration file, select fe.conf for FE configuration, and be.conf for BE configuration.

FE ConfigMap

Definition FE ConfigMap

When using ConfigMap to define FE configuration, you need to first define and deliver ConfigMap to the Kubernetes cluster.

The following example defines a ConfigMap named fe-conf:

  1. apiVersion: v1
  2. kind: ConfigMap
  3. metadata:
  4. name: fe-conf
  5. labels:
  6. app.kubernetes.io/component: fe
  7. data:
  8. fe.conf: |
  9. CUR_DATE=`date +%Y%m%d-%H%M%S`
  10. # the output dir of stderr and stdout
  11. LOG_DIR = ${DORIS_HOME}/log
  12. JAVA_OPTS="-Djavax.security.auth.useSubjectCredsOnly=false -Xss4m -Xmx8192m -XX:+UseMembar -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=7 -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSClassUnloadingEnabled -XX:-CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=80 -XX:SoftRefLRUPolicyMSPerMB=0 -Xloggc:$DORIS_HOME/log/fe.gc.log.$CUR_DATE"
  13. # For jdk 9+, this JAVA_OPTS will be used as default JVM options
  14. JAVA_OPTS_FOR_JDK_9="-Djavax.security.auth.useSubjectCredsOnly=false -Xss4m -Xmx8192m -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=7 -XX:+CMSClassUnloadingEnabled -XX:-CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=80 -XX:SoftRefLRUPolicyMSPerMB=0 -Xlog:gc*:$DORIS_HOME/log/fe.gc.log.$CUR_DATE:time"
  15. # INFO, WARN, ERROR, FATAL
  16. sys_log_level = INFO
  17. # NORMAL, BRIEF, ASYNC
  18. sys_log_mode = NORMAL
  19. # Default dirs to put jdbc drivers,default value is ${DORIS_HOME}/jdbc_drivers
  20. # jdbc_drivers_dir = ${DORIS_HOME}/jdbc_drivers
  21. http_port = 8030
  22. rpc_port = 9020
  23. query_port = 9030
  24. edit_log_port = 9010
  25. enable_fqdn_mode = true

Among them, the name of FE ConfigMap is defined in metadata.name, and the database configuration in fe.conf is defined in data. Be sure to add enable_fqdn_mode = true to your self-configured fe.conf

Configuring Doris Cluster - 图3Tip

Use the data field in ConfigMap to store key-value pairs. In the above FE ConfigMap:

  • fe.conf is the key in the key-value pair, using | means that newlines and indents in subsequent strings will be preserved
  • Subsequent configuration is the value in the key-value pair, which is the same as the configuration in the fe.conf file In the data field, due to the use of the | symbol to retain the subsequent string format, two spaces need to be maintained in subsequent configurations.

After defining the FE ConfigMap, you need to issue it through the kubectl apply command.

Using FE ConfigMap

If you need to use FE ConfigMap, you need to specify the defined ConfigMap through spec.feSpec.configMapInfo in the RC of Doris Cluster.

  1. Kind: DorisCluster
  2. metadata:
  3. name: doriscluster-sample-configmap
  4. spec:
  5. feSpec:
  6. configMapInfo:
  7. configMapName: {feConfigMapName}
  8. resolveKey: fe.conf
  9. ...

Replace ${feConfigMapName} with fe-conf in the above example to use the FE ConfigMap defined in the above example. For FE ConfigMap, you need to keep the resolveKey field fixed to fe.conf.

BE ConfigMap

Definition BE ConfigMap

When using ConfigMap to define BE configuration, you need to first define and deliver ConfigMap to the Kubernetes cluster.

The following example defines a ConfigMap named be-conf:

  1. apiVersion: v1
  2. kind: ConfigMap
  3. metadata:
  4. name: be-conf
  5. labels:
  6. app.kubernetes.io/component: be
  7. data:
  8. be.conf: |
  9. CUR_DATE=`date +%Y%m%d-%H%M%S`
  10. PPROF_TMPDIR="$DORIS_HOME/log/"
  11. JAVA_OPTS="-Xmx1024m -DlogPath=$DORIS_HOME/log/jni.log -Xloggc:$DORIS_HOME/log/be.gc.log.$CUR_DATE -Djavax.security.auth.useSubjectCredsOnly=false -Dsun.java.command=DorisBE -XX:-CriticalJNINatives -DJDBC_MIN_POOL=1 -DJDBC_MAX_POOL=100 -DJDBC_MAX_IDLE_TIME=300000 -DJDBC_MAX_WAIT_TIME=5000"
  12. # For jdk 9+, this JAVA_OPTS will be used as default JVM options
  13. JAVA_OPTS_FOR_JDK_9="-Xmx1024m -DlogPath=$DORIS_HOME/log/jni.log -Xlog:gc:$DORIS_HOME/log/be.gc.log.$CUR_DATE -Djavax.security.auth.useSubjectCredsOnly=false -Dsun.java.command =DorisBE -XX:-CriticalJNINatives -DJDBC_MIN_POOL=1 -DJDBC_MAX_POOL=100 -DJDBC_MAX_IDLE_TIME=300000 -DJDBC_MAX_WAIT_TIME=5000"
  14. # since 1.2, the JAVA_HOME need to be set to run BE process.
  15. # JAVA_HOME=/path/to/jdk/
  16. # https://github.com/apache/doris/blob/master/docs/zh-CN/community/developer-guide/debug-tool.md#jemalloc-heap-profile
  17. # https://jemalloc.net/jemalloc.3.html
  18. JEMALLOC_CONF="percpu_arena:percpu,background_thread:true,metadata_thp:auto,muzzy_decay_ms:15000,dirty_decay_ms:15000,oversize_threshold:0,lg_tcache_max:20,prof:false,lg_prof_interval:32,lg_prof_sample:19,prof_gd ump:false,prof_accum:false ,prof_leak:false,prof_final:false"
  19. JEMALLOC_PROF_PRFIX=""
  20. # INFO, WARNING, ERROR, FATAL
  21. sys_log_level = INFO
  22. # ports for admin, web, heartbeat service
  23. be_port = 9060
  24. webserver_port = 8040
  25. heartbeat_service_port = 9050
  26. brpc_port = 8060

Among them, the name of BE ConfigMap is defined in metadata.name, and the database configuration in be.conf is defined in data.

Configuring Doris Cluster - 图4Tip

Use the data field in ConfigMap to store key-value pairs. In the above BE ConfigMap:

  • be.conf is the key in the key-value pair, using | means that newlines and indents in subsequent strings will be retained
  • Subsequent configuration is the value in the key-value pair, which is the same as the configuration in the be.conf file In the data field, due to the use of the | symbol to retain the subsequent string format, two spaces need to be maintained in subsequent configurations.

After defining BE ConfigMap, you need to issue it through the kubectl apply command.

Using BE ConfigMap

If you need to use BE ConfigMap, you need to specify the defined ConfigMap through spec.beSpec.configMapInfo in the RC of Doris Cluster.

  1. Kind: DorisCluster
  2. metadata:
  3. name: doriscluster-sample-configmap
  4. spec:
  5. beSpec:
  6. configMapInfo:
  7. configMapName: {beConfigMapName}
  8. resolveKey: be.conf
  9. ...

Replace ${beConfigMapName} with be-conf in the above example to use the BE ConfigMap defined in the above example. For BE ConfigMap, you need to keep the resolveKey field fixed to be.conf.

Add external configuration files to the conf directory

When using the Catalog function to access external data sources, you need to add the relevant configuration files to the conf directory of the Doris node. For example, when accessing the hive catalog, you need to add core-site.xml, hdfs-site.xml and hive-site.xml The files are placed in the conf directories of FE and BE.

In the Kubernetes environment, the relevant configuration files of the catalog need to be loaded into Doris in the form of ConfigMap. The following example shows loading the core-site.xml file into BE:

  1. apiVersion: v1
  2. kind: ConfigMap
  3. metadata:
  4. name: be-configmap
  5. labels:
  6. app.kubernetes.io/component: be
  7. data:
  8. be.conf: |
  9. be_port = 9060
  10. webserver_port = 8040
  11. heartbeat_service_port = 9050
  12. brpc_port = 8060
  13. core-site.xml: |
  14. <?xml version="1.0" encoding="UTF-8"?>
  15. <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
  16. <configuration>
  17. <property>
  18. <name>hadoop.security.authentication</name>
  19. <value>kerberos</value>
  20. </property>
  21. </configuration>
  22. ...

Among them, the configured key-value pairs are stored in the data field. In the above example, the key-value pairs whose keys are be.conf and core-site.xml are stored.

In the data field, the following key-value structure mapping needs to be satisfied:

  1. data:
  2. filename_1: |
  3. config_string
  4. filename_2: |
  5. config_string
  6. filename_3: |
  7. config_string

Configure multi-disk storage for BE

Doris supports mounting multiple PVs for BE. By configuring the BE parameter storage_root_path, you can specify BE to use multi-disk storage. In the Kubernetes environment, you can map pv in DorisCluster CR and configure the storage_root_path parameter for BE through ConfigMap.

Configure pv mapping for BE multi-disk storage

In the DorisCluster CR file, compared to the single-disk configuration, you need to add the descriptions of configMapInfo and persistentVolumeClaimSpec:

  • The specified ConfigMap under the same namespace can be identified through configMapInfo configuration, and the resolveKey is fixed to be.conf
  • Multiple pv mappings can be configured for the BE storage directory through persistentVolumeClaimSpec

In the following example, the pv mapping of two disks is configured for BE:

  1. ...
  2. beSpec:
  3. replicas: 3
  4. image: selectdb/doris.be-ubuntu:2.0.2
  5. limits:
  6. cpu: 8
  7. memory: 16Gi
  8. requests:
  9. cpu: 8
  10. memory: 16Gi
  11. configMapInfo:
  12. configMapName: be-configmap
  13. resolveKey: be.conf
  14. persistentVolumes:
  15. - mountPath: /opt/apache-doris/be/storage1
  16. name: storage2
  17. persistentVolumeClaimSpec:
  18. # when use specific storageclass, the storageClassName should reConfig, example as annotation.
  19. #storageClassName: openebs-jiva-csi-default
  20. accessModes:
  21. - ReadWriteOnce
  22. resources:
  23. requests:
  24. storage: 100Gi
  25. - mountPath: /opt/apache-doris/be/storage2
  26. name: storage3
  27. persistentVolumeClaimSpec:
  28. # when use specific storageclass, the storageClassName should reConfig, example as annotation.
  29. #storageClassName: openebs-jiva-csi-default
  30. accessModes:
  31. - ReadWriteOnce
  32. resources:
  33. requests:
  34. storage: 100Gi
  35. - mountPath: /opt/apache-doris/be/log
  36. name: storage4
  37. persistentVolumeClaimSpec:
  38. # when use specific storageclass, the storageClassName should reConfig, example as annotation.
  39. #storageClassName: openebs-jiva-csi-default
  40. accessModes:
  41. - ReadWriteOnce
  42. resources:
  43. requests:
  44. storage: 100Gi

In the above example, the Doris cluster specifies multi-disk storage

  • beSpec.persistentVolumes specifies multiple pvs in an array, mapping two data storage pvs in /opt/apache-doris/be/storage{1,2}
  • beSpec.configMapInfo specifies that the ConfigMap named be-configmap needs to be mounted

Configure BE ConfigMap to specify the storage_root_path parameter

According to the BE ConfigMap name specified in DorisCluster CR, you need to create the corresponding ConfigMap and specify the storage_root_path parameter.

In the following example, the storage_root_path parameter is specified in the ConfigMap named be-configmap to use two disks:

  1. apiVersion: v1
  2. kind: ConfigMap
  3. metadata:
  4. name: be-configmap
  5. labels:
  6. app.kubernetes.io/component: be
  7. data:
  8. be.conf: |
  9. CUR_DATE=`date +%Y%m%d-%H%M%S`
  10. PPROF_TMPDIR="$DORIS_HOME/log/"
  11. JAVA_OPTS="-Xmx1024m -DlogPath=$DORIS_HOME/log/jni.log -Xloggc:$DORIS_HOME/log/be.gc.log.$CUR_DATE -Djavax.security.auth.useSubjectCredsOnly=false -Dsun.java.command=DorisBE -XX:-CriticalJNINatives -DJDBC_MIN_POOL=1 -DJDBC_MAX_POOL=100 -DJDBC_MAX_IDLE_TIME=300000 -DJDBC_MAX_WAIT_TIME=5000"
  12. # For jdk 9+, this JAVA_OPTS will be used as default JVM options
  13. JAVA_OPTS_FOR_JDK_9="-Xmx1024m -DlogPath=$DORIS_HOME/log/jni.log -Xlog:gc:$DORIS_HOME/log/be.gc.log.$CUR_DATE -Djavax.security.auth.useSubjectCredsOnly=false -Dsun.java.command =DorisBE -XX:-CriticalJNINatives -DJDBC_MIN_POOL=1 -DJDBC_MAX_POOL=100 -DJDBC_MAX_IDLE_TIME=300000 -DJDBC_MAX_WAIT_TIME=5000"
  14. # since 1.2, the JAVA_HOME need to be set to run BE process.
  15. # JAVA_HOME=/path/to/jdk/
  16. # https://github.com/apache/doris/blob/master/docs/zh-CN/community/developer-guide/debug-tool.md#jemalloc-heap-profile
  17. # https://jemalloc.net/jemalloc.3.html
  18. JEMALLOC_CONF="percpu_arena:percpu,background_thread:true,metadata_thp:auto,muzzy_decay_ms:15000,dirty_decay_ms:15000,oversize_threshold:0,lg_tcache_max:20,prof:false,lg_prof_interval:32,lg_prof_sample:19,prof_gd ump:false,prof_accum:false ,prof_leak:false,prof_final:false"
  19. JEMALLOC_PROF_PRFIX=""
  20. # INFO, WARNING, ERROR, FATAL
  21. sys_log_level = INFO
  22. # ports for admin, web, heartbeat service
  23. be_port = 9060
  24. webserver_port = 8040
  25. heartbeat_service_port = 9050
  26. brpc_port = 8060
  27. storage_root_path = /opt/apache-doris/be/storage,medium:ssd;/opt/apache-doris/be/storage1,medium:ssd

Configuring Doris Cluster - 图5Warning

When creating a BE ConfigMap, you need to pay attention to the following:

  1. metadata.name needs to be the same as beSpec.configMapInfo.configMapName in DorisCluster CR, indicating that the cluster uses the specified ConfigMap;
  2. The storage_root_path parameter in ConfigMap must correspond one-to-one with the persistentVolume data disk in DorisCluster CR.