SELECT syntax
- SELECT
- FROM
- VIEW (INDEX)
- WITH
- WHERE
- ORDER BY
- LIMIT and OFFSET
- ASSUME ORDER BY
- TABLESAMPLE and SAMPLE
- DISTINCT
- SELECT execution procedure
- Column order in YQL
- UNION ALL
- COMMIT
- WITHOUT
- FROM … SELECT …
- FROM AS_TABLE
SELECT
Returns the result of evaluating the expressions specified after SELECT
.
It can be used in combination with other operations to obtain other effect.
Examples:
SELECT "Hello, world!";
SELECT 2 + 2;
FROM
Data source for SELECT
. The argument can accept the table name, the result of another SELECT
, or a named expression. Between SELECT
and FROM
, list the comma-separated column names from the source (or *
to select all columns).
Examples
SELECT key FROM my_table;
SELECT * FROM
(SELECT value FROM my_table);
$table_name = "my_table";
SELECT * FROM $table_name;
VIEW (INDEX)
To make a SELECT
by secondary index statement, use the following:
SELECT *
FROM TableName VIEW IndexName
WHERE …
Examples
Select all the fields from the
series
table using theviews_index
index with theviews >=someValue
criteria:SELECT series_id, title, info, release_date, views, uploaded_user_id
FROM series VIEW views_index
WHERE views >= someValue
JOIN the
series
andusers
tables on theuserName
field using theusers_index
andname_index
indexes, respectively:SELECT t1.series_id, t1.title
FROM series VIEW users_index AS t1
INNER JOIN users VIEW name_index AS t2
ON t1.uploaded_user_id == t2.user_id
WHERE t2.name == userName;
WITH
It’s set after the data source in FROM
and is used for additional hints for tables. You can’t use hints for subqueries and named expressions.
The following values are supported:
INFER_SCHEMA
: Sets the flag for output of the table schema. The behavior is similar to the yt.inferSchema pragma, but for a specific data source. You can specify the number of rows to output (from 1 to 1000).FORCE_INFER_SCHEMA
: Sets the flag for table schema output. The behavior is similar to the yt.ForceInferSchema pragma, but for a specific data source. You can specify the number of rows to output (from 1 to 1000).DIRECT_READ
: Suppresses certain optimizers and enforces accessing table contents as is. The behavior is similar to the debug pragma DirectRead, but for a specific data source.INLINE
: Hints that the table contents is small and you need to use its in-memory view to process the query. The actual size of the table is not controlled in this case, and if it’s large, the query might fail with an out-of-memory error.UNORDERED
: Suppresses original table sorting.XLOCK
: Hints that you need to lock the table exclusively. It’s useful when you read a table at the stage of processing the query metaprogram, and then update its contents in the main query. Avoids data loss if an external process managed to change the table between executing a metaprogram phase and the main part of the query.SCHEMA
type: Hints that the specified table schema must be used entirely, ignoring the schema in the metadata.COLUMNS
type: Hints that the specified types should be used for columns whose names match the table’s column names in the metadata, as well as which columns are additionally present in the table.IGNORETYPEV3
,IGNORE_TYPE_V3
: Sets the flag to ignore type_v3 types in the table. The behavior is similar to the yt.IgnoreTypeV3 pragma, but for a specific data source.
When setting the SCHEMA
and COLUMNS
hints, the type must be a structure.
If you use the SCHEMA
hint, then with the table functions EACH, RANGE, LIKE, REGEXP, FILTER you can use an empty list of tables that is treated as an empty table with columns defined in the SCHEMA
.
Examples:
SELECT key FROM my_table WITH INFER_SCHEMA;
SELECT key FROM my_table WITH FORCE_INFER_SCHEMA="42";
$s = (SELECT COUNT(*) FROM my_table WITH XLOCK);
INSERT INTO my_table WITH TRUNCATE
SELECT EvaluateExpr($s) AS a;
SELECT key, value FROM my_table WITH SCHEMA Struct<key:String, value:Int32>;
SELECT key, value FROM my_table WITH COLUMNS Struct<value:Int32?>;
SELECT key, value FROM EACH($my_tables) WITH SCHEMA Struct<key:String, value:List<Int32>>;
WHERE
Filtering rows in the SELECT
result based on a condition.
Example
SELECT key FROM my_table
WHERE value > 0;
ORDER BY
Sorting the SELECT
result using a comma-separated list of sorting criteria. As a criteria, you can use a column value or an expression on columns. Ordering by column sequence number is not supported (ORDER BY N
where N
is a number).
Each criteria can be followed by the sorting direction:
ASC
: Sorting in the ascending order. Applied by default.DESC
: Sorting in the descending order.
Multiple sorting criteria will be applied left-to-right.
Example
SELECT key, string_column
FROM my_table
ORDER BY key DESC, LENGTH(string_column) ASC;
You can also use ORDER BY
for window functions.
LIMIT and OFFSET
LIMIT
: limits the output to the specified number of rows. By default, the output is not restricted.
OFFSET
: specifies the offset from the beginning (in rows). By default, it’s zero.
Examples
SELECT key FROM my_table
LIMIT 7;
SELECT key FROM my_table
LIMIT 7 OFFSET 3;
SELECT key FROM my_table
LIMIT 3, 7; -- equivalent to the previous example
ASSUME ORDER BY
Checking that the SELECT
result is sorted by the value in the specified column or multiple columns. The result of such a SELECT
statement is treated as sorted, but without actually running a sort. Sort check is performed at the query execution stage.
As in case of ORDER BY
, it supports setting the sort order using the keywords ASC
(ascending order) and DESC
(descending order). Expressions are not supported in ASSUME ORDER BY
.
Examples:
SELECT key || "suffix" as key, -CAST(subkey as Int32) as subkey
FROM my_table
ASSUME ORDER BY key, subkey DESC;
TABLESAMPLE and SAMPLE
Building a random sample from the data source specified in FROM
.
TABLESAMPLE
is part of the SQL standard and works as follows:
- The operating mode is specified:
BERNOULLI
means “slowly, straightforwardly going through all the data, but in a truly random way”.SYSTEM
uses knowledge about the physical data storage of data to avoid full data scans, but somewhat sacrificing randomness of the sample.
The data is split into sufficiently large blocks, and the whole data blocks are sampled. For applied calculations on sufficiently large tables, the result may well be consistent.
- The size of the random sample is indicated as a percentage after the operating mode, in parentheses.
- To manage the block size in the
SYSTEM
mode, use theyt.SamplingIoBlockSize
pragma. - Optionally, it can be followed by the
REPEATABLE
keyword and an integer in parentheses to be used as a seed for a pseudorandom number generator.
SAMPLE
is a shorter alias without sophisticated settings and sample size specified as a fraction. It currently corresponds to the BERNOULLI
mode.
Note
In the BERNOULLI
mode, if the REPEATABLE
keyword is added, the seed is mixed with the chunk ID for each chunk in the table. That’s why sampling from different tables with the same content might produce different results.
Examples:
SELECT *
FROM my_table
TABLESAMPLE BERNOULLI(1.0) REPEATABLE(123); -- one percent of the table
SELECT *
FROM my_table
TABLESAMPLE SYSTEM(1.0); -- about one percent of the table
SELECT *
FROM my_table
SAMPLE 1.0 / 3; -- one-third of the table
DISTINCT
Selecting unique rows.
Note
Applying DISTINCT
to calculated values is not currently implemented. For this purpose, use a subquery or the clause GROUP BY … AS ….
Example
SELECT DISTINCT value -- only unique values from the table
FROM my_table;
The DISTINCT
keyword can also be used to apply aggregate functions only to distinct values. For more information, see the documentation for GROUP BY.
SELECT execution procedure
The SELECT
query result is calculated as follows:
- Determine the set of input tables by evaluating the FROM clauses.
- Apply SAMPLE/TABLESAMPLE to input tables.
- Execute FLATTEN COLUMNS or FLATTEN BY; aliases set in
FLATTEN BY
become visible after this point. - Execute every JOIN.
- Add to (or replace in) the data the columns listed in GROUP BY … AS ….
- Execute WHERE — Discard all the data mismatching the predicate.
- Execute GROUP BY, evaluate aggregate functions.
- Apply the filter HAVING.
- Evaluate window functions;
- Evaluate expressions in
SELECT
. - Assign names set by aliases to expressions in
SELECT
. - Apply top-level DISTINCT to the resulting columns.
- Execute similarly every subquery inside UNION ALL, combine them (see PRAGMA AnsiOrderByLimitInUnionAll).
- Perform sorting with ORDER BY.
- Apply OFFSET and LIMIT to the result.
Column order in YQL
The standard SQL is sensitive to the order of columns in projections (that is, in SELECT
). While the order of columns must be preserved in the query results or when writing data to a new table, some SQL constructs use this order.
This applies, for example, to UNION ALL and positional ORDER BY (ORDER BY ordinal).
The column order is ignored in YQL by default:
- The order of columns in the output tables and query results is undefined
- The data scheme of the
UNION ALL
result is output by column names rather than positions
If you enable PRAGMA OrderedColumns;
, the order of columns is preserved in the query results and is derived from the order of columns in the input tables using the following rules:
SELECT
: an explicit column enumeration dictates the result order.SELECT
with an asterisk (SELECT * FROM ...
) inherits the order from its input.- The order of columns after JOIN: First output the left-hand columns, then the right-hand ones. If the column order in any of the sides in the
JOIN
output is undefined, the column order in the result is also undefined. - The order in
UNION ALL
depends on the UNION ALL execution mode. - The column order for AS_TABLE is undefined.
Warning
In the YT table schema, key columns always precede non-key columns. The order of key columns is determined by the order of the composite key.
When PRAGMA OrderedColumns;
is enabled, non-key columns preserve their output order.
UNION ALL
Concatenating results of multiple SELECT
statements (or subqueries).
Two UNION ALL
modes are supported: by column names (the default mode) and by column positions (corresponds to the ANSI SQL standard and is enabled by the PRAGMA).
In the “by name” mode, the output of the resulting data schema uses the following rules:
- The resulting table includes all columns that were found in at least one of the input tables.
- If a column wasn’t present in all the input tables, then it’s automatically assigned the optional data type (that can accept
NULL
). - If a column in different input tables had different types, then the shared type (the broadest one) is output.
- If a column in different input tables had a heterogeneous type, for example, string and numeric, an error is raised.
The order of output columns in this mode is equal to the largest common prefix of the order of inputs, followed by all other columns in the alphabetic order.
If the largest common prefix is empty (for example, if the order isn’t specified for one of the inputs), then the output order is undefined.
In the “by position” mode, the output of the resulting data schema uses the following rules:
- All inputs must have equal number of columns
- The order of columns must be defined for all inputs
- The names of the resulting columns must match the names of columns in the first table
- The type of the resulting columns is output as a common (widest) type of input column types having the same positions
The order of the output columns in this mode is the same as the order of columns in the first input.
Examples
SELECT 1 AS x
UNION ALL
SELECT 2 AS y
UNION ALL
SELECT 3 AS z;
In the default mode, this query returns a selection with three columns x, y, and z. When PRAGMA PositionalUnionAll;
is enabled, the selection only includes the x column.
PRAGMA PositionalUnionAll;
SELECT 1 AS x, 2 as y
UNION ALL
SELECT * FROM AS_TABLE([<|x:3, y:4|>]); -- error: the order of columns in AS_TABLE is undefined
COMMIT
By default, the entire YQL query is executed within a single transaction, and independent parts inside it are executed in parallel, if possible.
Using the COMMIT;
keyword you can add a barrier to the execution process to delay execution of expressions that follow until all the preceding expressions have completed.
To commit in the same way automatically after each expression in the query, you can use PRAGMA autocommit;
.
Examples:
INSERT INTO result1 SELECT * FROM my_table;
INSERT INTO result2 SELECT * FROM my_table;
COMMIT;
-- result2 will already include the SELECT contents from the second line:
INSERT INTO result3 SELECT * FROM result2;
WITHOUT
Excluding columns from the result of SELECT *
.
Examples
SELECT * WITHOUT foo, bar FROM my_table;
PRAGMA simplecolumns;
SELECT * WITHOUT t.foo FROM my_table AS t
CROSS JOIN (SELECT 1 AS foo) AS v;
FROM … SELECT …
An inverted format, first specifying the data source and then the operation.
Examples
FROM my_table SELECT key, value;
FROM a_table AS a
JOIN b_table AS b
USING (key)
SELECT *;
FROM AS_TABLE
Accessing named expressions as tables using the AS_TABLE
function.
AS_TABLE($variable)
lets you use the value of $variable
as the data source for the query. In this case, the variable $variable
must have the type List<Struct<...>>
.
Example
$data = AsList(
AsStruct(1u AS Key, "v1" AS Value),
AsStruct(2u AS Key, "v2" AS Value),
AsStruct(3u AS Key, "v3" AS Value));
SELECT Key, Value FROM AS_TABLE($data);