SELECT syntax

SELECT syntax

SELECT
FROM
VIEW (INDEX)
WITH
WHERE
ORDER BY
LIMIT and OFFSET
ASSUME ORDER BY
TABLESAMPLE and SAMPLE
DISTINCT
SELECT execution procedure
Column order in YQL
UNION ALL
COMMIT
WITHOUT
FROM … SELECT …
FROM AS_TABLE

SELECT

Returns the result of evaluating the expressions specified after SELECT.

It can be used in combination with other operations to obtain other effect.

Examples:

SELECT "Hello, world!";

SELECT 2 + 2;

FROM

Data source for SELECT. The argument can accept the table name, the result of another SELECT, or a named expression. Between SELECT and FROM, list the comma-separated column names from the source (or * to select all columns).

Examples

SELECT key FROM my_table;

SELECT * FROM
  (SELECT value FROM my_table);

$table_name = "my_table";
SELECT * FROM $table_name;

VIEW (INDEX)

To make a SELECT by secondary index statement, use the following:

SELECT *
    FROM TableName VIEW IndexName
    WHERE …

Examples

Select all the fields from the series table using the views_index index with the views >=someValue criteria:

SELECT series_id, title, info, release_date, views, uploaded_user_id
    FROM series VIEW views_index
    WHERE views >= someValue

JOIN the series and users tables on the userName field using the users_index and name_index indexes, respectively:

SELECT t1.series_id, t1.title
    FROM series VIEW users_index AS t1
    INNER JOIN users VIEW name_index AS t2
    ON t1.uploaded_user_id == t2.user_id
    WHERE t2.name == userName;

WITH

It’s set after the data source in FROM and is used for additional hints for tables. You can’t use hints for subqueries and named expressions.

The following values are supported:

INFER_SCHEMA: Sets the flag for output of the table schema. The behavior is similar to the yt.inferSchema pragma, but for a specific data source. You can specify the number of rows to output (from 1 to 1000).
FORCE_INFER_SCHEMA: Sets the flag for table schema output. The behavior is similar to the yt.ForceInferSchema pragma, but for a specific data source. You can specify the number of rows to output (from 1 to 1000).
DIRECT_READ: Suppresses certain optimizers and enforces accessing table contents as is. The behavior is similar to the debug pragma DirectRead, but for a specific data source.
INLINE: Hints that the table contents is small and you need to use its in-memory view to process the query. The actual size of the table is not controlled in this case, and if it’s large, the query might fail with an out-of-memory error.
UNORDERED: Suppresses original table sorting.
XLOCK: Hints that you need to lock the table exclusively. It’s useful when you read a table at the stage of processing the query metaprogram, and then update its contents in the main query. Avoids data loss if an external process managed to change the table between executing a metaprogram phase and the main part of the query.
SCHEMA type: Hints that the specified table schema must be used entirely, ignoring the schema in the metadata.
COLUMNS type: Hints that the specified types should be used for columns whose names match the table’s column names in the metadata, as well as which columns are additionally present in the table.
IGNORETYPEV3, IGNORE_TYPE_V3: Sets the flag to ignore type_v3 types in the table. The behavior is similar to the yt.IgnoreTypeV3 pragma, but for a specific data source.

When setting the SCHEMA and COLUMNS hints, the type must be a structure.
If you use the SCHEMA hint, then with the table functions EACH, RANGE, LIKE, REGEXP, FILTER you can use an empty list of tables that is treated as an empty table with columns defined in the SCHEMA.

Examples:

SELECT key FROM my_table WITH INFER_SCHEMA;
SELECT key FROM my_table WITH FORCE_INFER_SCHEMA="42";

$s = (SELECT COUNT(*) FROM my_table WITH XLOCK);
INSERT INTO my_table WITH TRUNCATE
SELECT EvaluateExpr($s) AS a;

SELECT key, value FROM my_table WITH SCHEMA Struct<key:String, value:Int32>;

SELECT key, value FROM my_table WITH COLUMNS Struct<value:Int32?>;

SELECT key, value FROM EACH($my_tables) WITH SCHEMA Struct<key:String, value:List<Int32>>;

WHERE

Filtering rows in the SELECT result based on a condition.

Example

SELECT key FROM my_table
WHERE value > 0;

ORDER BY

Sorting the SELECT result using a comma-separated list of sorting criteria. As a criteria, you can use a column value or an expression on columns. Ordering by column sequence number is not supported (ORDER BY Nwhere N is a number).

Each criteria can be followed by the sorting direction:

ASC: Sorting in the ascending order. Applied by default.
DESC: Sorting in the descending order.

Multiple sorting criteria will be applied left-to-right.

Example

SELECT key, string_column
FROM my_table
ORDER BY key DESC, LENGTH(string_column) ASC;

You can also use ORDER BY for window functions.

LIMIT and OFFSET

LIMIT: limits the output to the specified number of rows. By default, the output is not restricted.

OFFSET: specifies the offset from the beginning (in rows). By default, it’s zero.

Examples

SELECT key FROM my_table
LIMIT 7;

SELECT key FROM my_table
LIMIT 7 OFFSET 3;

SELECT key FROM my_table
LIMIT 3, 7; -- equivalent to the previous example

ASSUME ORDER BY

Checking that the SELECT result is sorted by the value in the specified column or multiple columns. The result of such a SELECT statement is treated as sorted, but without actually running a sort. Sort check is performed at the query execution stage.

As in case of ORDER BY, it supports setting the sort order using the keywords ASC (ascending order) and DESC (descending order). Expressions are not supported in ASSUME ORDER BY.

Examples:

SELECT key || "suffix" as key, -CAST(subkey as Int32) as subkey
FROM my_table
ASSUME ORDER BY key, subkey DESC;

TABLESAMPLE and SAMPLE

Building a random sample from the data source specified in FROM.

TABLESAMPLE is part of the SQL standard and works as follows:

The operating mode is specified:
- BERNOULLI means “slowly, straightforwardly going through all the data, but in a truly random way”.
- SYSTEM uses knowledge about the physical data storage of data to avoid full data scans, but somewhat sacrificing randomness of the sample.
  The data is split into sufficiently large blocks, and the whole data blocks are sampled. For applied calculations on sufficiently large tables, the result may well be consistent.
The size of the random sample is indicated as a percentage after the operating mode, in parentheses.
To manage the block size in the SYSTEM mode, use the yt.SamplingIoBlockSize pragma.
Optionally, it can be followed by the REPEATABLE keyword and an integer in parentheses to be used as a seed for a pseudorandom number generator.

SAMPLE is a shorter alias without sophisticated settings and sample size specified as a fraction. It currently corresponds to the BERNOULLI mode.

Note

In the BERNOULLI mode, if the REPEATABLE keyword is added, the seed is mixed with the chunk ID for each chunk in the table. That’s why sampling from different tables with the same content might produce different results.

Examples:

SELECT *
FROM my_table
TABLESAMPLE BERNOULLI(1.0) REPEATABLE(123); -- one percent of the table

SELECT *
FROM my_table
TABLESAMPLE SYSTEM(1.0); -- about one percent of the table

SELECT *
FROM my_table
SAMPLE 1.0 / 3; -- one-third of the table

DISTINCT

Selecting unique rows.

Note

Applying DISTINCT to calculated values is not currently implemented. For this purpose, use a subquery or the clause GROUP BY … AS ….

Example

SELECT DISTINCT value -- only unique values from the table
FROM my_table;

The DISTINCT keyword can also be used to apply aggregate functions only to distinct values. For more information, see the documentation for GROUP BY.

SELECT execution procedure

The SELECT query result is calculated as follows:

Determine the set of input tables by evaluating the FROM clauses.
Apply SAMPLE/TABLESAMPLE to input tables.
Execute FLATTEN COLUMNS or FLATTEN BY; aliases set in FLATTEN BY become visible after this point.
Execute every JOIN.
Add to (or replace in) the data the columns listed in GROUP BY … AS ….
Execute WHERE — Discard all the data mismatching the predicate.
Execute GROUP BY, evaluate aggregate functions.
Apply the filter HAVING.
Evaluate window functions;
Evaluate expressions in SELECT.
Assign names set by aliases to expressions in SELECT.
Apply top-level DISTINCT to the resulting columns.
Execute similarly every subquery inside UNION ALL, combine them (see PRAGMA AnsiOrderByLimitInUnionAll).
Perform sorting with ORDER BY.
Apply OFFSET and LIMIT to the result.

Column order in YQL

The standard SQL is sensitive to the order of columns in projections (that is, in SELECT). While the order of columns must be preserved in the query results or when writing data to a new table, some SQL constructs use this order.
This applies, for example, to UNION ALL and positional ORDER BY (ORDER BY ordinal).

The column order is ignored in YQL by default:

The order of columns in the output tables and query results is undefined
The data scheme of the UNION ALL result is output by column names rather than positions

If you enable PRAGMA OrderedColumns;, the order of columns is preserved in the query results and is derived from the order of columns in the input tables using the following rules:

SELECT: an explicit column enumeration dictates the result order.
SELECT with an asterisk (SELECT * FROM ...) inherits the order from its input.
The order of columns after JOIN: First output the left-hand columns, then the right-hand ones. If the column order in any of the sides in the JOIN output is undefined, the column order in the result is also undefined.
The order in UNION ALL depends on the UNION ALL execution mode.
The column order for AS_TABLE is undefined.

Warning

In the YT table schema, key columns always precede non-key columns. The order of key columns is determined by the order of the composite key.
When PRAGMA OrderedColumns; is enabled, non-key columns preserve their output order.

UNION ALL

Concatenating results of multiple SELECT statements (or subqueries).

Two UNION ALL modes are supported: by column names (the default mode) and by column positions (corresponds to the ANSI SQL standard and is enabled by the PRAGMA).

In the “by name” mode, the output of the resulting data schema uses the following rules:

The resulting table includes all columns that were found in at least one of the input tables.
If a column wasn’t present in all the input tables, then it’s automatically assigned the optional data type (that can accept NULL).
If a column in different input tables had different types, then the shared type (the broadest one) is output.
If a column in different input tables had a heterogeneous type, for example, string and numeric, an error is raised.

The order of output columns in this mode is equal to the largest common prefix of the order of inputs, followed by all other columns in the alphabetic order.
If the largest common prefix is empty (for example, if the order isn’t specified for one of the inputs), then the output order is undefined.

In the “by position” mode, the output of the resulting data schema uses the following rules:

All inputs must have equal number of columns
The order of columns must be defined for all inputs
The names of the resulting columns must match the names of columns in the first table
The type of the resulting columns is output as a common (widest) type of input column types having the same positions

The order of the output columns in this mode is the same as the order of columns in the first input.

Examples

SELECT 1 AS x
UNION ALL
SELECT 2 AS y
UNION ALL
SELECT 3 AS z;

In the default mode, this query returns a selection with three columns x, y, and z. When PRAGMA PositionalUnionAll; is enabled, the selection only includes the x column.

PRAGMA PositionalUnionAll;
SELECT 1 AS x, 2 as y
UNION ALL
SELECT * FROM AS_TABLE([<|x:3, y:4|>]); -- error: the order of columns in AS_TABLE is undefined

COMMIT

By default, the entire YQL query is executed within a single transaction, and independent parts inside it are executed in parallel, if possible.
Using the COMMIT; keyword you can add a barrier to the execution process to delay execution of expressions that follow until all the preceding expressions have completed.

To commit in the same way automatically after each expression in the query, you can use PRAGMA autocommit;.

Examples:

INSERT INTO result1 SELECT * FROM my_table;
INSERT INTO result2 SELECT * FROM my_table;
COMMIT;
-- result2 will already include the SELECT contents from the second line:
INSERT INTO result3 SELECT * FROM result2;

WITHOUT

Excluding columns from the result of SELECT *.

Examples

SELECT * WITHOUT foo, bar FROM my_table;

PRAGMA simplecolumns;
SELECT * WITHOUT t.foo FROM my_table AS t
CROSS JOIN (SELECT 1 AS foo) AS v;

FROM … SELECT …

An inverted format, first specifying the data source and then the operation.

Examples

FROM my_table SELECT key, value;

FROM a_table AS a
JOIN b_table AS b
USING (key)
SELECT *;

FROM AS_TABLE

Accessing named expressions as tables using the AS_TABLE function.

AS_TABLE($variable) lets you use the value of $variable as the data source for the query. In this case, the variable $variable must have the type List<Struct<...>>.

Example

$data = AsList(
    AsStruct(1u AS Key, "v1" AS Value),
    AsStruct(2u AS Key, "v2" AS Value),
    AsStruct(3u AS Key, "v3" AS Value));
SELECT Key, Value FROM AS_TABLE($data);