local

Name

local

Description

Local table-valued-function(tvf), allows users to read and access local file contents on be node, just like accessing relational table. Currently supports csv/csv_with_names/csv_with_names_and_types/json/parquet/orc file format.

It needs ADMIN privilege to use.

syntax

  1. local(
  2. "file_path" = "path/to/file.txt",
  3. "backend_id" = "be_id",
  4. "format" = "csv",
  5. "keyn" = "valuen"
  6. ...
  7. );

parameter description

  • Related parameters for accessing local file on be node:

    • file_path:

      (required) The path of the file to be read, which is a relative path to the user_files_secure_path directory, where user_files_secure_path parameter can be configured on be.

      Can not contains .. in path. Support using glob syntax to match multi files, such as log/*.log

  • Related to execution method:

    In versions prior to 2.1.1, Doris only supported specifying a BE node to read local data files on that node.

    • backend_id:

      1. The be id where the file is located. `backend_id` can be obtained through the `show backends` command.

      Starting from version 2.1.2, Doris adds a new parameter shared_storage.

    • shared_storage

      Default is false. If true, the specified file exists on shared storage (such as NAS). Shared storage must be compatible with the POXIS file interface and mounted on all BE nodes at the same time.

      When shared_storage is true, you do not need to set backend_id, Doris may use all BE nodes for data access. If backend_id is set, still only executes on the specified BE node.

  • File format parameters:

    • format: (required) Currently support csv/csv_with_names/csv_with_names_and_types/json/parquet/orc
    • column_separator: (optional) default ,.
    • line_delimiter: (optional) default \n.
    • compress_type: (optional) Currently support UNKNOWN/PLAIN/GZ/LZO/BZ2/LZ4FRAME/DEFLATE/SNAPPYBLOCK. Default value is UNKNOWN, it will automatically infer the type based on the suffix of uri.
  • The following parameters are used for loading in json format. For specific usage methods, please refer to: Json Load

    • read_json_by_line: (optional) default "true"
    • strip_outer_array: (optional) default "false"
    • json_root: (optional) default ""
    • json_paths: (optional) default ""
    • num_as_string: (optional) default false
    • fuzzy_parse: (optional) default false
  • The following parameters are used for loading in csv format

    • trim_double_quotes: Boolean type (optional), the default value is false. True means that the outermost double quotes of each field in the csv file are trimmed.
    • skip_lines: Integer type (optional), the default value is 0. It will skip some lines in the head of csv file. It will be disabled when the format is csv_with_names or csv_with_names_and_types.

Examples

Analyze the log file on specified BE:

  1. mysql> select * from local(
  2. "file_path" = "log/be.out",
  3. "backend_id" = "10006",
  4. "format" = "csv")
  5. where c1 like "%start_time%" limit 10;
  6. +--------------------------------------------------------+
  7. | c1 |
  8. +--------------------------------------------------------+
  9. | start time: 2023 08 07 星期一 23:20:32 CST |
  10. | start time: 2023 08 07 星期一 23:32:10 CST |
  11. | start time: 2023 08 08 星期二 00:20:50 CST |
  12. | start time: 2023 08 08 星期二 00:29:15 CST |
  13. +--------------------------------------------------------+

Read and access csv format files located at path ${DORIS_HOME}/student.csv:

  1. mysql> select * from local(
  2. "file_path" = "student.csv",
  3. "backend_id" = "10003",
  4. "format" = "csv");
  5. +------+---------+--------+
  6. | c1 | c2 | c3 |
  7. +------+---------+--------+
  8. | 1 | alice | 18 |
  9. | 2 | bob | 20 |
  10. | 3 | jack | 24 |
  11. | 4 | jackson | 19 |
  12. | 5 | liming | d18 |
  13. +------+---------+--------+

Query files on NAS:

  1. mysql> select * from local(
  2. "file_path" = "/mnt/doris/prefix_*.txt",
  3. "format" = "csv",
  4. "column_separator" =",",
  5. "shared_storage" = "true");
  6. +------+------+------+
  7. | c1 | c2 | c3 |
  8. +------+------+------+
  9. | 1 | 2 | 3 |
  10. | 1 | 2 | 3 |
  11. | 1 | 2 | 3 |
  12. | 1 | 2 | 3 |
  13. | 1 | 2 | 3 |
  14. +------+------+------+

Can be used with desc function :

  1. mysql> desc function local(
  2. "file_path" = "student.csv",
  3. "backend_id" = "10003",
  4. "format" = "csv");
  5. +-------+------+------+-------+---------+-------+
  6. | Field | Type | Null | Key | Default | Extra |
  7. +-------+------+------+-------+---------+-------+
  8. | c1 | TEXT | Yes | false | NULL | NONE |
  9. | c2 | TEXT | Yes | false | NULL | NONE |
  10. | c3 | TEXT | Yes | false | NULL | NONE |
  11. +-------+------+------+-------+---------+-------+

Keywords

local, table-valued-function, tvf

Best Practice

  • For more detailed usage of local tvf, please refer to S3 tvf, The only difference between them is the way of accessing the storage system.

  • Access data on NAS through local tvf

    NAS shared storage allows to be mounted to multiple nodes at the same time. Each node can access files in the shared storage just like local files. Therefore, the NAS can be thought of as a local file system, accessed through local tvf.

    When setting "shared_storage" = "true", Doris will think that the specified file can be accessed from any BE node. When a set of files is specified using wildcards, Doris will distribute requests to access files to multiple BE nodes, so that multiple nodes can be used to perform distributed file scanning and improve query performance.