Manipulation - SELECT - 《Apache Doris v2.1 Documentation》

SELECT

SELECT

Name

SELECT

description

Mainly introduces the use of Select syntax

grammar:

SELECT
    [hint_statement, ...]
    [ALL | DISTINCT | DISTINCTROW | ALL EXCEPT ( col_name1 [, col_name2, col_name3, ...] )]
    select_expr [, select_expr ...]
    [FROM table_references
      [PARTITION partition_list]
      [TABLET tabletid_list]
      [TABLESAMPLE sample_value [ROWS | PERCENT]
        [REPEATABLE pos_seek]]
    [WHERE where_condition]
    [GROUP BY [GROUPING SETS | ROLLUP | CUBE] {col_name | expr | position}]
    [HAVING where_condition]
    [ORDER BY {col_name | expr | position}
      [ASC | DESC], ...]
    [LIMIT {[offset,] row_count | row_count OFFSET offset}]
    [INTO OUTFILE 'file_name']

Syntax Description:
1. select_expr, … Columns retrieved and displayed in the result, when using an alias, as is optional.
2. select_expr, … Retrieved target table (one or more tables (including temporary tables generated by subqueries)
3. where_definition retrieves the condition (expression), if there is a WHERE clause, the condition filters the row data. where_condition is an expression that evaluates to true for each row to be selected. Without the WHERE clause, the statement selects all rows. In WHERE expressions, you can use any MySQL supported functions and operators except aggregate functions
4. ALL | DISTINCT : to refresh the result set, all is all, distinct/distinctrow will refresh the duplicate columns, the default is all
5. SinceVersion 1.2 `ALL EXCEPT`: Filter on the full (all) result set, except specifies the name of one or more columns to be excluded from the full result set. All matching column names will be ignored in the output.
6. INTO OUTFILE 'file_name' : save the result to a new file (which did not exist before), the difference lies in the save format.
7. Group by having: Group the result set, and brush the result of group by when having appears. Grouping Sets, Rollup, Cube are extensions of group by, please refer to GROUPING SETS DESIGN for details.
8. Order by: Sort the final result, Order by sorts the result set by comparing the size of one or more columns.
  
  Order by is a time-consuming and resource-intensive operation, because all data needs to be sent to 1 node before it can be sorted, and the sorting operation requires more memory than the non-sorting operation.
  
  If you need to return the top N sorted results, you need to use the LIMIT clause; in order to limit memory usage, if the user does not specify the LIMIT clause, the first 65535 sorted results are returned by default.
9. Limit n: limit the number of lines in the output result, limit m,n means output n records starting from the mth line.You should use order by before you use limit m,n, otherwise the data may be inconsistent each time it is executed.
10. The Having clause does not filter the row data in the table, but filters the results produced by the aggregate function.
  
  Typically having is used with aggregate functions (eg :COUNT(), SUM(), AVG(), MIN(), MAX()) and group by clauses.
11. SELECT supports explicit partition selection using PARTITION containing a list of partitions or subpartitions (or both) following the name of the table in table_reference
12. [TABLET tids] TABLESAMPLE n [ROWS | PERCENT] [REPEATABLE seek]: Limit the number of rows read from the table in the FROM clause, select a number of Tablets pseudo-randomly from the table according to the specified number of rows or percentages, and specify the number of seeds in REPEATABLE to return the selected samples again. In addition, you can also manually specify the TableID, Note that this can only be used for OLAP tables.
13. hint_statement: hint in front of the selectlist indicates that hints can be used to influence the behavior of the optimizer in order to obtain the desired execution plan. Details refer to [joinHint using document] (https://doris.apache.org/en/docs/query-acceleration/hint/joinHint.md)

Syntax constraints:

SELECT can also be used to retrieve calculated rows without referencing any table.
All clauses must be ordered strictly according to the above format, and a HAVING clause must be placed after the GROUP BY clause and before the ORDER BY clause.
The alias keyword AS is optional. Aliases can be used for group by, order by and having
Where clause: The WHERE statement is executed to determine which rows should be included in the GROUP BY section, and HAVING is used to determine which rows in the result set should be used.
The HAVING clause can refer to the total function, but the WHERE clause cannot refer to, such as count, sum, max, min, avg, at the same time, the where clause can refer to other functions except the total function. Column aliases cannot be used in the Where clause to define conditions.
Group by followed by with rollup can count the results one or more times.

Join query:

Doris supports JOIN syntax

JOIN
table_references:
    table_reference [, table_reference] …
table_reference:
    table_factor
  | join_table
table_factor:
    tbl_name [[AS] alias]
        [{USE|IGNORE|FORCE} INDEX (key_list)]
  | ( table_references )
  | { OJ table_reference LEFT OUTER JOIN table_reference
        ON conditional_expr }
join_table:
    table_reference [INNER | CROSS] JOIN table_factor [join_condition]
  | table_reference LEFT [OUTER] JOIN table_reference join_condition
  | table_reference NATURAL [LEFT [OUTER]] JOIN table_factor
  | table_reference RIGHT [OUTER] JOIN table_reference join_condition
  | table_reference NATURAL [RIGHT [OUTER]] JOIN table_factor
join_condition:
    ON conditional_expr

UNION Grammar：

SELECT ...
UNION [ALL| DISTINCT] SELECT ......
[UNION [ALL| DISTINCT] SELECT ...]

UNION is used to combine the results of multiple SELECT statements into a single result set.

The column names in the first SELECT statement are used as the column names in the returned results. The selected columns listed in the corresponding position of each SELECT statement should have the same data type. (For example, the first column selected by the first statement should be of the same type as the first column selected by other statements.)

The default behavior of UNION is to remove duplicate rows from the result. The optional DISTINCT keyword has no effect other than the default, since it also specifies duplicate row removal. With the optional ALL keyword, no duplicate row removal occurs, and the result includes all matching rows in all SELECT statements

WITH statement:

To specify common table expressions, use the WITH clause with one or more comma-separated clauses. Each subclause provides a subquery that generates the result set and associates the name with the subquery. The following example defines WITH clauses in CTEs named cte1 and cte2, and refers to the WITH clause below their top-level SELECT:

WITH
  cte1 AS (SELECT a，b FROM table1),
  cte2 AS (SELECT c，d FROM table2)
SELECT b，d FROM cte1 JOIN cte2
WHERE cte1.a = cte2.c;

In a statement containing the WITH clause, each CTE name can be referenced to access the corresponding CTE result set.

CTE names can be referenced in other CTEs, allowing CTEs to be defined based on other CTEs.

Recursive CTE is currently not supported.

example

Query the names of students whose ages are 18, 20, 25

select Name from student where age in (18,20,25);

ALL EXCEPT Example

-- Query all information except the students' age
select * except(age) from student;

GROUP BY Example

--Query the tb_book table, group by type, and find the average price of each type of book,
select type,avg(price) from tb_book group by type;

DISTINCT Use

--Query the tb_book table to remove duplicate type data
select distinct type from tb_book;

ORDER BY Example

Sort query results in ascending (default) or descending (DESC) order. Ascending NULL is first, descending NULL is last

--Query all records in the tb_book table, sort them in descending order by id, and display three records
select * from tb_book order by id desc limit 3;

LIKE fuzzy query

Can realize fuzzy query, it has two wildcards: % and _, % can match one or more characters, _ can match one character
```
--Find all books whose second character is h
select * from tb_book where name like('_h%');
```

LIMIT limits the number of result rows

--1. Display 3 records in descending order
select * from tb_book order by price desc limit 3;
--2. Display 4 records from id=1
select * from tb_book where id limit 1,4;

CONCAT join multiple columns

--Combine name and price into a new string output
select id,concat(name,":",price) as info,type from tb_book;

Using functions and expressions

--Calculate the total price of various books in the tb_book table
select sum(price) as total,type from tb_book group by type;
--20% off price
select *,(price * 0.8) as "20%" from tb_book;

UNION Example

SELECT a FROM t1 WHERE a = 10 AND B = 1 ORDER by LIMIT 10
UNION
SELECT a FROM t2 WHERE a = 11 AND B = 2 ORDER by LIMIT 10;

WITH clause example

WITH cte AS
(
 SELECT 1 AS col1, 2 AS col2
 UNION ALL
 SELECT 3, 4
)
SELECT col1, col2 FROM cte;

JOIN Exampel

SELECT * FROM t1 LEFT JOIN (t2, t3, t4)
                ON (t2.a = t1.a AND t3.b = t1.b AND t4.c = t1.c)

Equivalent to

SELECT * FROM t1 LEFT JOIN (t2 CROSS JOIN t3 CROSS JOIN t4)
                ON (t2.a = t1.a AND t3.b = t1.b AND t4.c = t1.c)

INNER JOIN

SELECT t1.name, t2.salary
 FROM employee AS t1 INNER JOIN info AS t2 ON t1.name = t2.name;
SELECT t1.name, t2.salary
 FROM employee t1 INNER JOIN info t2 ON t1.name = t2.name;

LEFT JOIN

SELECT left_tbl.*
 FROM left_tbl LEFT JOIN right_tbl ON left_tbl.id = right_tbl.id
 WHERE right_tbl.id IS NULL;

RIGHT JOIN

mysql> SELECT * FROM t1 RIGHT JOIN t2 ON (t1.a = t2.a);
+------+------+------+------+
| a    | b    | a    | c    |
+------+------+------+------+
|    2 | y    |    2 | z    |
| NULL | NULL |    3 | w    |
+------+------+------+------+

TABLESAMPLE

--Pseudo-randomly sample 1000 rows in t1. Note that several Tablets are actually selected according to the statistics of the table, and the total number of selected Tablet rows may be greater than 1000, so if you want to explicitly return 1000 rows, you need to add Limit.
SELECT * FROM t1 TABLET(10001) TABLESAMPLE(1000 ROWS) REPEATABLE 2 limit 1000;

keywords

SELECT

Best Practice

ome additional knowledge about the SELECT clause
- An alias can be specified for select_expr using AS alias_name. Aliases are used as column names in expressions and can be used in GROUP BY, ORDER BY or HAVING clauses. The AS keyword is a good habit to use when specifying aliases for columns.
- table_references after FROM indicates one or more tables participating in the query. If more than one table is listed, a JOIN operation is performed. And for each specified table, you can define an alias for it
- The selected column after SELECT can be referenced in ORDER IN and GROUP BY by column name, column alias or integer (starting from 1) representing the column position
```
SELECT college, region, seed FROM tournament
  ORDER BY region, seed;
SELECT college, region AS r, seed AS s FROM tournament
  ORDER BY r, s;
SELECT college, region, seed FROM tournament
  ORDER BY 2, 3;
```
- If ORDER BY appears in a subquery and also applies to the outer query, the outermost ORDER BY takes precedence.
- If GROUP BY is used, the grouped columns are automatically sorted in ascending order (as if there was an ORDER BY statement followed by the same columns). If you want to avoid the overhead of GROUP BY due to automatic sorting, adding ORDER BY NULL can solve it:
```
SELECT a, COUNT(b) FROM test_table GROUP BY a ORDER BY NULL;
```

When sorting columns in a SELECT using ORDER BY or GROUP BY, the server sorts values using only the initial number of bytes indicated by the max_sort_length system variable.
Having clauses are generally applied last, just before the result set is returned to the MySQL client, and is not optimized. (while LIMIT is applied after HAVING)

The SQL standard requires: HAVING must refer to a column in the GROUP BY list or used by an aggregate function. However, MySQL extends this by allowing HAVING to refer to columns in the Select clause list, as well as columns from outer subqueries.

A warning is generated if the column referenced by HAVING is ambiguous. In the following statement, col2 is ambiguous:
```
SELECT COUNT(col1) AS col2 FROM t GROUP BY col2 HAVING col2 = 2;
```
Remember not to use HAVING where WHERE should be used. HAVING is paired with GROUP BY.

The HAVING clause can refer to aggregate functions, while WHERE cannot.

SELECT user, MAX(salary) FROM users
  GROUP BY user HAVING MAX(salary) > 10;

The LIMIT clause can be used to constrain the number of rows returned by a SELECT statement. LIMIT can have one or two arguments, both of which must be non-negative integers.

/*Retrieve 6~15 rows in the result set*/
SELECT * FROM tbl LIMIT 5,10;
/*Then if you want to retrieve all rows after a certain offset is set, you can set a very large constant for the second parameter. The following query fetches all data from row 96 onwards */
SELECT * FROM tbl LIMIT 95,18446744073709551615;
/*If LIMIT has only one parameter, the parameter specifies the number of rows that should be retrieved, and the offset defaults to 0, that is, starting from the first row*/

SELECT…INTO allows query results to be written to a file

Modifiers of the SELECT keyword
- deduplication
  
  The ALL and DISTINCT modifiers specify whether to deduplicate rows in the result set (should not be a column).
  
  ALL is the default modifier, that is, all rows that meet the requirements are to be retrieved.
  
  DISTINCT removes duplicate rows.
The main advantage of subqueries
- Subqueries allow structured queries so that each part of a statement can be isolated.
- Some operations require complex unions and associations. Subqueries provide other ways to perform these operations
Speed up queries
- Use Doris’s partition and bucket as data filtering conditions as much as possible to reduce the scope of data scanning
- Make full use of Doris’s prefix index fields as data filter conditions to speed up query speed
UNION
- Using only the union keyword has the same effect as using union disitnct. Since the deduplication work is more memory-intensive, the query speed using the union all operation will be faster and the memory consumption will be less. If users want to perform order by and limit operations on the returned result set, they need to put the union operation in the subquery, then select from subquery, and finally put the subquery and order by outside the subquery.
```
select * from (select age from student_01 union all select age from student_02) as t1
order by age limit 4;
+-------------+
| age |
+-------------+
| 18 |
| 19 |
| 20 |
| 21 |
+-------------+
4 rows in set (0.01 sec)
```
JOIN
- In the inner join condition, in addition to supporting equal-valued joins, it also supports unequal-valued joins. For performance reasons, it is recommended to use equal-valued joins.
- Other joins only support equivalent joins