使用须知

TPCDS Catalog 通过 Trino Connector 兼容框架,使用 TPCDS Connector 来快速构建 TPCDS 测试集。

TPCDS - 图1提示

该功能自 Doris 3.0.0 版本开始支持。

编译 TPCDS Connector

需要 JDK 17 版本。

  1. git clone https://github.com/trinodb/trino.git
  2. git checkout 435
  3. cd trino/plugin/trino-tpcds
  4. mvn clean install -DskipTest

完成编译后,会在 trino/plugin/trino-tpcds/target/ 下得到 trino-tpcds-435/ 目录。

也可以直接下载预编译的 trino-tpcds-435.tar.gz 并解压。

部署 TPCDS Connector

trino-tpcds-435/ 目录放到所有 FE 和 BE 部署路径的 connectors/ 目录下。(如果没有,可以手动创建)。

  1. ├── bin
  2. ├── conf
  3. ├── connectors
  4. ├── trino-tpcds-435
  5. ...

部署完成后,建议重启 FE、BE 节点以确保 Connector 可以被正确加载。

创建 TPCDS Catalog

  1. CREATE CATALOG `tpcds` PROPERTIES (
  2. "type" = "trino-connector",
  3. "trino.connector.name" = "tpcds",
  4. "trino.tpcds.split-count" = "32"
  5. );

其中 tpcds.split-count 为并发数,建议设置为 BE 单机核数的 2 倍,可以获得最优的并发度。提升数据生成效率。

使用 TPCDS Catalog

TPCDS Catalog 中预制了不同 Scale Factor 的 TPCDS 数据集,可以通过 SHOW DATABASESSHOW TABLES 命令查看。

  1. mysql> SWITCH tpcds;
  2. Query OK, 0 rows affected (0.00 sec)
  3. mysql> SHOW DATABASES;
  4. +--------------------+
  5. | Database |
  6. +--------------------+
  7. | information_schema |
  8. | mysql |
  9. | sf1 |
  10. | sf100 |
  11. | sf1000 |
  12. | sf10000 |
  13. | sf100000 |
  14. | sf300 |
  15. | sf3000 |
  16. | sf30000 |
  17. | tiny |
  18. +--------------------+
  19. 11 rows in set (0.00 sec)
  20. mysql> USE sf1;
  21. mysql> SHOW TABLES;
  22. +------------------------+
  23. | Tables_in_sf1 |
  24. +------------------------+
  25. | call_center |
  26. | catalog_page |
  27. | catalog_returns |
  28. | catalog_sales |
  29. | customer |
  30. | customer_address |
  31. | customer_demographics |
  32. | date_dim |
  33. | dbgen_version |
  34. | household_demographics |
  35. | income_band |
  36. | inventory |
  37. | item |
  38. | promotion |
  39. | reason |
  40. | ship_mode |
  41. | store |
  42. | store_returns |
  43. | store_sales |
  44. | time_dim |
  45. | warehouse |
  46. | web_page |
  47. | web_returns |
  48. | web_sales |
  49. | web_site |
  50. +------------------------+
  51. 25 rows in set (0.00 sec)

通过 SELECT 语句可以直接查询这些表。

TPCDS - 图2提示

这些预制数据集的数据,并没有实际存储,而是在查询时实时生成的。所以这些预制数据集不适合用来直接进行 Benchmark 测试。适用于通过 INSERT INTO SELECT 将数据集写入到其他目的表(如 Doris 内表、Hive、Iceberg 等所有 Doris 支持写入的数据源)后,对目的表进行性能测试。

最佳实践

快速构建 TPCDS 测试数据集

可以通过 CTAS 语句快速构建一个 TPCDS 测试数据集:

  1. CREATE TABLE hive.tpcds100.call_center PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.call_center ;
  2. CREATE TABLE hive.tpcds100.catalog_page PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.catalog_page ;
  3. CREATE TABLE hive.tpcds100.catalog_returns PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.catalog_returns ;
  4. CREATE TABLE hive.tpcds100.catalog_sales PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.catalog_sales ;
  5. CREATE TABLE hive.tpcds100.customer PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.customer ;
  6. CREATE TABLE hive.tpcds100.customer_address PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.customer_address ;
  7. CREATE TABLE hive.tpcds100.customer_demographics PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.customer_demographics ;
  8. CREATE TABLE hive.tpcds100.date_dim PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.date_dim ;
  9. CREATE TABLE hive.tpcds100.dbgen_version PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.dbgen_version ;
  10. CREATE TABLE hive.tpcds100.household_demographics PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.household_demographics;
  11. CREATE TABLE hive.tpcds100.income_band PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.income_band ;
  12. CREATE TABLE hive.tpcds100.inventory PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.inventory ;
  13. CREATE TABLE hive.tpcds100.item PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.item ;
  14. CREATE TABLE hive.tpcds100.promotion PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.promotion ;
  15. CREATE TABLE hive.tpcds100.reason PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.reason ;
  16. CREATE TABLE hive.tpcds100.ship_mode PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.ship_mode ;
  17. CREATE TABLE hive.tpcds100.store PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.store ;
  18. CREATE TABLE hive.tpcds100.store_returns PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.store_returns ;
  19. CREATE TABLE hive.tpcds100.store_sales PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.store_sales ;
  20. CREATE TABLE hive.tpcds100.time_dim PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.time_dim ;
  21. CREATE TABLE hive.tpcds100.warehouse PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.warehouse ;
  22. CREATE TABLE hive.tpcds100.web_page PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.web_page ;
  23. CREATE TABLE hive.tpcds100.web_returns PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.web_returns ;
  24. CREATE TABLE hive.tpcds100.web_sales PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.web_sales ;
  25. CREATE TABLE hive.tpcds100.web_site PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.web_site ;

TPCDS - 图3提示

在包含 3 个 16C BE 节点的 Doris 集群上,创建一个 TPCDS 1000 的 Hive 数据集,大约需要 3 到 4 个小时。