Doris On ES

Doris-On-ES combines Doris’s distributed query planning capability with ES (Elastic search)’s full-text search capability to provide a more complete OLAP scenario solution:

  1. Multi-index Distributed Join Query in ES
  2. Joint Query of Tables in Doris and ES, More Complex Full-Text Retrieval and Filtering
  3. Aggregated queries for fields of ES keyword type: suitable for frequent changes in index, tens of millions or more of single fragmented documents, and the cardinality of the field is very large

This document mainly introduces the realization principle and usage of this function.

Noun Interpretation

  • FE: Frontend, the front-end node of Doris. Responsible for metadata management and request access.
  • BE: Backend, Doris’s back-end node. Responsible for query execution and data storage.
  • Elastic search (ES): The most popular open source distributed search engine.
  • DataNode: The data storage and computing node of ES.
  • MasterNode: The Master node of ES, which manages metadata, nodes, data distribution, etc.
  • scroll: The built-in data set cursor feature of ES for streaming scanning and filtering of data.

How to use it

Create appearance

  1. CREATE EXTERNAL TABLE `es_table` (
  2. `id` bigint(20) COMMENT "",
  3. `k1` bigint(20) COMMENT "",
  4. `k2` datetime COMMENT "",
  5. `k3` varchar(20) COMMENT "",
  6. `k4` varchar(100) COMMENT "",
  7. `k5` float COMMENT ""
  8. ) ENGINE=ELASTICSEARCH
  9. PARTITION BY RANGE(`id`)
  10. ()
  11. PROPERTIES (
  12. "hosts" = "http://192.168.0.1:8200,http://192.168.0.2:8200",
  13. "user" = "root",
  14. "password" = "root",
  15. "index" = "tindex”,
  16. "type" = "doc"
  17. );

Description of parameters:

Parameterdescription
HostsES Cluster Connection Address, which can specify one or more, through which Doris obtains the share distribution information of ES version number and index
UserOpen the user name of the ES cluster authenticated by basic, you need to ensure that the user has access to: / cluster / state / nodes / HTTP and other path permissions and read permissions for index
Passwordcorresponding user’s password information
The index name of the ES corresponding to the table in indexDoris can be alias
TypeSpecifies the type of index, defaulting to _doc
TransportInternal reservation, default to http

Query

Basic Conditions Filtration

  1. select * from es_table where k1 > 1000 and k3 ='term' or k4 like 'fu*z_'

Extended esquery SQL grammar

The first column name parameter of esquery is used to associate index, the second parameter is the JSON expression of the basic Query DSL, and the curly bracket {} is used to include root of json. There is and can only be one key of json, such as mat. Ch, geo_shape, bool, etc.

Match query:

  1. select * from es_table where esquery(k4, '{
  2. "match": {
  3. "k4": "doris on elasticsearch"
  4. }
  5. }');

Geo related queries:

  1. select * from es_table where esquery(k4, '{
  2. "geo_shape": {
  3. "location": {
  4. "shape": {
  5. "type": "envelope",
  6. "coordinates": [
  7. [
  8. 13,
  9. 53
  10. ],
  11. [
  12. 14,
  13. 52
  14. ]
  15. ]
  16. },
  17. "relation": "within"
  18. }
  19. }
  20. }');

Bool query:

  1. select * from es_table where esquery(k4, ' {
  2. "bool": {
  3. "must": [
  4. {
  5. "terms": {
  6. "k1": [
  7. 11,
  8. 12
  9. ]
  10. }
  11. },
  12. {
  13. "terms": {
  14. "k2": [
  15. 100
  16. ]
  17. }
  18. }
  19. ]
  20. }
  21. }');

Principle

  1. +----------------------------------------------+
  2. | |
  3. | Doris +------------------+ |
  4. | | FE +--------------+-------+
  5. | | | Request Shard Location
  6. | +--+-------------+-+ | |
  7. | ^ ^ | |
  8. | | | | |
  9. | +-------------------+ +------------------+ | |
  10. | | | | | | | | |
  11. | | +----------+----+ | | +--+-----------+ | | |
  12. | | | BE | | | | BE | | | |
  13. | | +---------------+ | | +--------------+ | | |
  14. +----------------------------------------------+ |
  15. | | | | | | |
  16. | | | | | | |
  17. | HTTP SCROLL | | HTTP SCROLL | |
  18. +-----------+---------------------+------------+ |
  19. | | v | | v | | |
  20. | | +------+--------+ | | +------+-------+ | | |
  21. | | | | | | | | | | |
  22. | | | DataNode | | | | DataNode +<-----------+
  23. | | | | | | | | | | |
  24. | | | +<--------------------------------+
  25. | | +---------------+ | | |--------------| | | |
  26. | +-------------------+ +------------------+ | |
  27. | Same Physical Node | |
  28. | | |
  29. | +-----------------------+ | |
  30. | | | | |
  31. | | MasterNode +<-----------------+
  32. | ES | | |
  33. | +-----------------------+ |
  34. +----------------------------------------------+
  1. After the ES appearance is created, FE requests the host specified by the table to obtain HTTP port information of all nodes and share distribution information of index. If the request fails, it will traverse the host list sequentially until it succeeds or fails completely.

  2. When querying, the query plan will be generated and sent to the corresponding BE node according to some node information obtained by FE and metadata information of index.

  3. The BE node requests locally deployed ES nodes in accordance with the proximity principle. The BE receives data concurrently from each fragment of ES index in the HTTP Scroll mode.

  4. After calculating the result, return it to client

Push-Down operations

An important function of Doris On Elastic search is to push down filtering conditions: push ES under filtering conditions, so that only data that really meets the conditions can be returned, which can significantly improve query performance and reduce the CPU, memory and IO utilization of Doris and Elastic search.

The following operators are optimized to push down filters as follows:

SQL syntaxES 5.x+ syntax
=term query
interms query
> , < , >= , ⇐range
andbool.filter
orbool.should
notbool.must_not
not inbool.must_not + terms
esqueryES Query DSL

Other notes

  1. ES Version Requirements

    The main version of ES is larger than 5. The scanning mode of ES data before 2. X and after 5. x is different. At present, the scanning mode of ES data after 5. x is supported.

  2. Does ES Cluster Support X-Pack Authentication

    Support all ES clusters using HTTP Basic authentication

  3. Some queries are much slower than requesting ES

    Yes, for example, query related to _count, etc., the ES internal will directly read the number of documents that meet the requirements of the relevant metadata, without the need to filter the real data.