Quick Start Guide for TiDB HTAP

This guide walks you through the quickest way to get started with TiDB’s one-stop solution of Hybrid Transactional and Analytical Processing (HTAP).

Try Out HTAP - 图1

Note

The steps provided in this guide is ONLY for quick start in the test environment. For production environments, explore HTAP is recommended.

Basic concepts

Before using TiDB HTAP, you need to have some basic knowledge about TiKV, a row-based storage engine for TiDB Online Transactional Processing (OLTP), and TiFlash, a columnar storage engine for TiDB Online Analytical Processing (OLAP).

  • Storage engines of HTAP: The row-based storage engine and the columnar storage engine co-exist for HTAP. Both storage engines can replicate data automatically and keep strong consistency. The row-based storage engine optimizes OLTP performance, and the columnar storage engine optimizes OLAP performance.
  • Data consistency of HTAP: As a distributed and transactional key-value database, TiKV provides transactional interfaces with ACID compliance, and guarantees data consistency between multiple replicas and high availability with the implementation of the Raft consensus algorithm. As a columnar storage extension of TiKV, TiFlash replicates data from TiKV in real time according to the Raft Learner consensus algorithm, which ensures that data is strongly consistent between TiKV and TiFlash.
  • Data isolation of HTAP: TiKV and TiFlash can be deployed on different machines as needed to solve the problem of HTAP resource isolation.
  • MPP computing engine: MPP is a distributed computing framework provided by the TiFlash engine since TiDB 5.0, which allows data exchange between nodes and provides high-performance, high-throughput SQL algorithms. In the MPP mode, the run time of the analytic queries can be significantly reduced.

Steps

In this document, you can experience the convenience and high performance of TiDB HTAP by querying an example table in a TPC-H dataset. TPC-H is a popular decision support benchmark that consists of a suite of business oriented ad-hoc queries with large volumes of data and a high degree of complexity. To experience 22 complete SQL queries using TPC-H, visit tidb-bench repo or TPC-H for instructions on how to generate query statements and data.

Step 1. Deploy a local test environment

Before using TiDB HTAP, follow the steps in the Quick Start Guide for the TiDB Database Platform to prepare a local test environment, and run the following command to deploy a TiDB cluster:

  1. tiup playground

Try Out HTAP - 图2

Note

tiup playground command is ONLY for quick start, NOT for production.

Step 2. Prepare test data

In the following steps, you can create a TPC-H dataset as the test data to use TiDB HTAP. If you are interested in TPC-H, see General Implementation Guidelines.

Try Out HTAP - 图3

Note

If you want to use your existing data for analytic queries, you can migrate your data to TiDB. If you want to design and create your own test data, you can create it by executing SQL statements or using related tools.

  1. Install the test data generation tool by running the following command:

    1. tiup install bench
  2. Generate the test data by running the following command:

    1. tiup bench tpch --sf=1 prepare

    If the output of this command shows Finished, it indicates that the data is created.

  3. Execute the following SQL statement to view the generated data:

    1. SELECT
    2. CONCAT(table_schema,'.',table_name) AS 'Table Name',
    3. table_rows AS 'Number of Rows',
    4. FORMAT_BYTES(data_length) AS 'Data Size',
    5. FORMAT_BYTES(index_length) AS 'Index Size',
    6. FORMAT_BYTES(data_length+index_length) AS'Total'
    7. FROM
    8. information_schema.TABLES
    9. WHERE
    10. table_schema='test';

    As you can see from the output, eight tables are created in total, and the largest table has 6.5 million rows (the number of rows created by the tool depends on the actual SQL query result because the data is randomly generated).

    1. +---------------+----------------+-----------+------------+-----------+
    2. | Table Name | Number of Rows | Data Size | Index Size | Total |
    3. +---------------+----------------+-----------+------------+-----------+
    4. | test.nation | 25 | 2.44 KiB | 0 bytes | 2.44 KiB |
    5. | test.region | 5 | 416 bytes | 0 bytes | 416 bytes |
    6. | test.part | 200000 | 25.07 MiB | 0 bytes | 25.07 MiB |
    7. | test.supplier | 10000 | 1.45 MiB | 0 bytes | 1.45 MiB |
    8. | test.partsupp | 800000 | 120.17 MiB| 12.21 MiB | 132.38 MiB|
    9. | test.customer | 150000 | 24.77 MiB | 0 bytes | 24.77 MiB |
    10. | test.orders | 1527648 | 174.40 MiB| 0 bytes | 174.40 MiB|
    11. | test.lineitem | 6491711 | 849.07 MiB| 99.06 MiB | 948.13 MiB|
    12. +---------------+----------------+-----------+------------+-----------+
    13. 8 rows in set (0.06 sec)

    This is a database of a commercial ordering system. In which, the test.nation table indicates the information about countries, the test.region table indicates the information about regions, the test.part table indicates the information about parts, the test.supplier table indicates the information about suppliers, the test.partsupp table indicates the information about parts of suppliers, the test.customer table indicates the information about customers, the test.customer table indicates the information about orders, and the test.lineitem table indicates the information about online items.

Step 3. Query data with the row-based storage engine

To know the performance of TiDB with only the row-based storage engine, execute the following SQL statements:

  1. SELECT
  2. l_orderkey,
  3. SUM(
  4. l_extendedprice * (1 - l_discount)
  5. ) AS revenue,
  6. o_orderdate,
  7. o_shippriority
  8. FROM
  9. customer,
  10. orders,
  11. lineitem
  12. WHERE
  13. c_mktsegment = 'BUILDING'
  14. AND c_custkey = o_custkey
  15. AND l_orderkey = o_orderkey
  16. AND o_orderdate < DATE '1996-01-01'
  17. AND l_shipdate > DATE '1996-02-01'
  18. GROUP BY
  19. l_orderkey,
  20. o_orderdate,
  21. o_shippriority
  22. ORDER BY
  23. revenue DESC,
  24. o_orderdate
  25. limit 10;

This is a shipping priority query, which provides the priority and potential revenue of the highest-revenue order that has not been shipped before a specified date. The potential revenue is defined as the sum of l_extendedprice * (1-l_discount). The orders are listed in the descending order of revenue. In this example, this query lists the unshipped orders with potential query revenue in the top 10.

Step 4. Replicate the test data to the columnar storage engine

After TiFlash is deployed, TiKV does not replicate data to TiFlash immediately. You need to execute the following DDL statements in a MySQL client of TiDB to specify which tables need to be replicated. After that, TiDB will create the specified replicas in TiFlash accordingly.

  1. ALTER TABLE test.customer SET TIFLASH REPLICA 1;
  2. ALTER TABLE test.orders SET TIFLASH REPLICA 1;
  3. ALTER TABLE test.lineitem SET TIFLASH REPLICA 1;

To check the replication status of the specific tables, execute the following statements:

  1. SELECT * FROM information_schema.tiflash_replica WHERE TABLE_SCHEMA = 'test' and TABLE_NAME = 'customer';
  2. SELECT * FROM information_schema.tiflash_replica WHERE TABLE_SCHEMA = 'test' and TABLE_NAME = 'orders';
  3. SELECT * FROM information_schema.tiflash_replica WHERE TABLE_SCHEMA = 'test' and TABLE_NAME = 'lineitem';

In the result of the above statements:

  • AVAILABLE indicates whether the TiFlash replica of a specific table is available or not. 1 means available and 0 means unavailable. Once the AVAILABLE field becomes 1, this status does not change anymore.
  • PROGRESS means the progress of the replication. The value is between 0.0 and 1.0. 1 means that the replication progress of the TiFlash replica is complete.

Step 5. Analyze data faster using HTAP

Execute the SQL statements in Step 3 again, and you can see the performance of TiDB HTAP.

For tables with TiFlash replicas, the TiDB optimizer automatically determines whether to use TiFlash replicas based on the cost estimation. To check whether or not a TiFlash replica is selected, you can use the desc or explain analyze statement. For example:

  1. explain analyze SELECT
  2. l_orderkey,
  3. SUM(
  4. l_extendedprice * (1 - l_discount)
  5. ) AS revenue,
  6. o_orderdate,
  7. o_shippriority
  8. FROM
  9. customer,
  10. orders,
  11. lineitem
  12. WHERE
  13. c_mktsegment = 'BUILDING'
  14. AND c_custkey = o_custkey
  15. AND l_orderkey = o_orderkey
  16. AND o_orderdate < DATE '1996-01-01'
  17. AND l_shipdate > DATE '1996-02-01'
  18. GROUP BY
  19. l_orderkey,
  20. o_orderdate,
  21. o_shippriority
  22. ORDER BY
  23. revenue DESC,
  24. o_orderdate
  25. limit 10;

If the result of the EXPLAIN statement shows ExchangeSender and ExchangeReceiver operators, it indicates that the MPP mode has taken effect.

In addition, you can specify that each part of the entire query is computed using only the TiFlash engine. For detailed information, see Use TiDB to read TiFlash replicas.

You can compare query results and query performance of these two methods.

What’s next