Virtual columns

Virtual columns - 图1info

Apache Druid supports two query languages: Druid SQL and native queries. This document describes the native language. For information about functions available in SQL, refer to the SQL documentation.

Virtual columns are queryable column “views” created from a set of columns during a query.

A virtual column can potentially draw from multiple underlying columns, although a virtual column always presents itself as a single column.

Virtual columns can be referenced by their output names to be used as dimensions or as inputs to filters and aggregators.

Each Apache Druid query can accept a list of virtual columns as a parameter. The following scan query is provided as an example:

  1. {
  2. "queryType": "scan",
  3. "dataSource": "page_data",
  4. "columns":[],
  5. "virtualColumns": [
  6. {
  7. "type": "expression",
  8. "name": "fooPage",
  9. "expression": "concat('foo' + page)",
  10. "outputType": "STRING"
  11. },
  12. {
  13. "type": "expression",
  14. "name": "tripleWordCount",
  15. "expression": "wordCount * 3",
  16. "outputType": "LONG"
  17. }
  18. ],
  19. "intervals": [
  20. "2013-01-01/2019-01-02"
  21. ]
  22. }

Virtual column types

Expression virtual column

Expression virtual columns use Druid’s native expression system to allow defining query time transforms of inputs from one or more columns.

The expression virtual column has the following syntax:

  1. {
  2. "type": "expression",
  3. "name": <name of the virtual column>,
  4. "expression": <row expression>,
  5. "outputType": <output value type of expression>
  6. }
propertydescriptionrequired?
typeMust be “expression” to indicate that this is an expression virtual column.yes
nameThe name of the virtual column.yes
expressionAn expression that takes a row as input and outputs a value for the virtual column.yes
outputTypeThe expression’s output will be coerced to this type. Can be LONG, FLOAT, DOUBLE, STRING, ARRAY types, or COMPLEX types.no, default is FLOAT

Nested field virtual column

The nested field virtual column is an optimized virtual column that can provide direct access into various paths of a COMPLEX<json> column, including using their indexes.

This virtual column is used for the SQL operators JSON_VALUE (if processFromRaw is set to false) or JSON_QUERY (if processFromRaw is true), and accepts ‘JSONPath’ or ‘jq’ syntax string representations of paths, or a parsed list of “path parts” in order to determine what should be selected from the column.

You can define a nested field virtual column with any of the following equivalent syntaxes. The examples all produce the same output value, with each example showing a different way to specify how to access the nested value. The first is using JSONPath syntax path, the second with a jq path, and the third uses pathParts.

  1. {
  2. "type": "nested-field",
  3. "columnName": "shipTo",
  4. "outputName": "v0",
  5. "expectedType": "STRING",
  6. "path": "$.phoneNumbers[1].number"
  7. }
  1. {
  2. "type": "nested-field",
  3. "columnName": "shipTo",
  4. "outputName": "v1",
  5. "expectedType": "STRING",
  6. "path": ".phoneNumbers[1].number",
  7. "useJqSyntax": true
  8. }
  1. {
  2. "type": "nested-field",
  3. "columnName": "shipTo",
  4. "outputName": "v2",
  5. "expectedType": "STRING",
  6. "pathParts": [
  7. {
  8. "type": "field",
  9. "field": "phoneNumbers"
  10. },
  11. {
  12. "type": "arrayElement",
  13. "index": 1
  14. },
  15. {
  16. "type": "field",
  17. "field": "number"
  18. }
  19. ]
  20. }
propertydescriptionrequired?
typeMust be “nested-field” to indicate that this is a nested field virtual column.yes
columnNameThe name of the COMPLEX<json> input column.yes
outputNameThe name of the virtual column.yes
expectedTypeThe native Druid output type of the column, Druid will coerce output to this type if it does not match the underlying data. This can be STRING, LONG, FLOAT, DOUBLE, or COMPLEX<json>. Extracting ARRAY types is not yet supported.no, default STRING
pathPartsThe parsed path parts used to locate the nested values. path will be translated into pathParts internally. One of path or pathParts must be setno, if path is defined
processFromRawIf set to true, the virtual column will process the “raw” JSON data to extract values rather than using an optimized “literal” value selector. This option allows extracting non-literal values (such as nested JSON objects or arrays) as a COMPLEX<json> at the cost of much slower performance.no, default false
path‘JSONPath’ (or ‘jq’) syntax path. One of path or pathParts must be set.no, if pathParts is defined
useJqSyntaxIf true, parse path using ‘jq’ syntax instead of ‘JSONPath’.no, default is false

Nested path part

Specify pathParts as an array of objects that describe each component of the path to traverse. Each object can take the following properties:

propertydescriptionrequired?
typeMust be ‘field’ or ‘arrayElement’. Use field when accessing a specific field in a nested structure. Use arrayElement when accessing a specific integer position of an array (zero based).yes
fieldThe name of the ‘field’ in a ‘field’ type path partyes, if type is ‘field’
indexThe array element index if type is arrayElementyes, if type is ‘arrayElement’

See Nested columns for more information on ingesting and storing nested data.

List filtered virtual column

This virtual column provides an alternative way to use ‘list filtered’ dimension spec as a virtual column. It has optimized access to the underlying column value indexes that can provide a small performance improvement in some cases.

  1. {
  2. "type": "mv-filtered",
  3. "name": "filteredDim3",
  4. "delegate": "dim3",
  5. "values": ["hello", "world"],
  6. "isAllowList": true
  7. }
propertydescriptionrequired?
typeMust be “mv-filtered” to indicate that this is a list filtered virtual column.yes
nameThe output name of the virtual columnyes
delegateThe name of the multi-value STRING input column to filteryes
valuesSet of STRING values to allow or denyyes
isAllowListIf true, the output of the virtual column will be limited to the set specified by values, else it will provide all values except those specified.No, default true