索引原理

NGram BloomFilter 索引和 BloomFilter 索引类似,也是基于 BloomFilter 的跳数索引。

与 BloomFilter 索引不同的是,NGram BloomFilter 索引用于加速文本 LIKE 查询,它存入 BloomFilter 的不是原始文本的值,而是对文本进行 NGram 分词,每个词作为值存入 BloomFilter。对于 LIKE 查询,将 LIKE ‘%pattern%’ 的 pattern 也进行 NGram 分词,判断每个词是否在 BloomFilter 中,如果某个词不在则对应的数据块就不满足 LIKE 条件,可以跳过这部分数据减少IO加速查询。

使用场景

NGram BloomFilter 索引只能加速字符串 LIKE 查询,而且 LIKE pattern 中的连续字符个数要大于等于索引定义的 NGram 中的 N。

N-Gram 索引 - 图1提示

  • NGram BloomFilter 只支持字符串列,只能加速 LIKE 查询。

  • NGram BloomFilter 索引和 BloomFilter 索引为互斥关系,即同一个列只能设置两者中的一个。

  • NGram BloomFilter 索引的效果分析,跟 BloomFilter 索引类似。

使用语法

创建 NGram BloomFilter 索引

在建表语句中 COLUMN 的定义之后是索引定义:

  1. INDEX `idx_column_name` (`column_name`) USING NGRAM_BF PROPERTIES("gram_size"="3", "bf_size"="1024") COMMENT 'username ngram_bf index'

语法说明如下:

1. idx_column_name(column_name) 是必须的,column_name 是建索引的列名,必须是前面列定义中出现过的,idx_column_name 是索引名字,必须表级别唯一,建议命名规范:列名前面加前缀 idx_

2. USING NGRAM_BF 是必须的,用于指定索引类型是 NGram BloomFilter 索引

3. PROPERTIES 是可选的,用于指定 NGram BloomFilter 索引的额外属性,目前支持的属性如下:

  • gram_size:NGram 中的 N,指定 N 个连续字符分词一个词,比如 ‘an ngram example’ 在 N = 3 的时候分成 ‘an ‘, ‘n n’, ‘ ng’, ‘ngr’, ‘gra’, ‘ram’ 6 个词。

  • bf_size:BloomFilter 的大小,单位是 Bit。bf_size 决定每个数据块对应的索引大小,这个值越大占用存储空间越大,同时 Hash 碰撞的概率也越低。

  • gram_size 建议取 LIKE 查询的字符串最小长度,但是不建议低于 2。一般建议设置 “gram_size”=”3”, “bf_size”=”1024”,然后根据 Query Profile 调优。

4. COMMENT 是可选的,用于指定索引注释

查看 NGram BloomFilter 索引

  1. SHOW CREATE TABLE table_ngrambf;

删除 NGram BloomFilter 索引

  1. ALTER TABLE table_ngrambf DROP INDEX idx_ngrambf;

修改 NGram BloomFilter 索引

  1. CREATE INDEX idx_column_name2(column_name2) ON table_ngrambf USING NGRAM_BF PROPERTIES("gram_size"="3", "bf_size"="1024") COMMENT 'username ngram_bf index';
  2. ALTER TABLE table_ngrambf ADD INDEX idx_column_name2(column_name2) USING NGRAM_BF PROPERTIES("gram_size"="3", "bf_size"="1024") COMMENT 'username ngram_bf index';

使用示例

以亚马逊产品的用户评论信息的数据集 amazon_reviews 为例展示 NGram BloomFilter 索引的使用和效果。

建表

  1. CREATE TABLE `amazon_reviews` (
  2. `review_date` int(11) NULL,
  3. `marketplace` varchar(20) NULL,
  4. `customer_id` bigint(20) NULL,
  5. `review_id` varchar(40) NULL,
  6. `product_id` varchar(10) NULL,
  7. `product_parent` bigint(20) NULL,
  8. `product_title` varchar(500) NULL,
  9. `product_category` varchar(50) NULL,
  10. `star_rating` smallint(6) NULL,
  11. `helpful_votes` int(11) NULL,
  12. `total_votes` int(11) NULL,
  13. `vine` boolean NULL,
  14. `verified_purchase` boolean NULL,
  15. `review_headline` varchar(500) NULL,
  16. `review_body` string NULL
  17. ) ENGINE=OLAP
  18. DUPLICATE KEY(`review_date`)
  19. COMMENT 'OLAP'
  20. DISTRIBUTED BY HASH(`review_date`) BUCKETS 16
  21. PROPERTIES (
  22. "replication_allocation" = "tag.location.default: 1",
  23. "compression" = "ZSTD"
  24. );

导入数据

用 wget 或者其他工具从下面的地址下载数据集

  1. https://datasets-documentation.s3.eu-west-3.amazonaws.com/amazon_reviews/amazon_reviews_2010.snappy.parquet
  2. https://datasets-documentation.s3.eu-west-3.amazonaws.com/amazon_reviews/amazon_reviews_2011.snappy.parquet
  3. https://datasets-documentation.s3.eu-west-3.amazonaws.com/amazon_reviews/amazon_reviews_2012.snappy.parquet
  4. https://datasets-documentation.s3.eu-west-3.amazonaws.com/amazon_reviews/amazon_reviews_2013.snappy.parquet
  5. https://datasets-documentation.s3.eu-west-3.amazonaws.com/amazon_reviews/amazon_reviews_2014.snappy.parquet
  6. https://datasets-documentation.s3.eu-west-3.amazonaws.com/amazon_reviews/amazon_reviews_2015.snappy.parquet

用 stream load 导入数据

  1. curl --location-trusted -u root: -T amazon_reviews_2010.snappy.parquet -H "format:parquet" http://127.0.0.1:8030/api/${DB}/amazon_reviews/_stream_load
  2. curl --location-trusted -u root: -T amazon_reviews_2011.snappy.parquet -H "format:parquet" http://127.0.0.1:8030/api/${DB}/amazon_reviews/_stream_load
  3. curl --location-trusted -u root: -T amazon_reviews_2012.snappy.parquet -H "format:parquet" http://127.0.0.1:8030/api/${DB}/amazon_reviews/_stream_load
  4. curl --location-trusted -u root: -T amazon_reviews_2013.snappy.parquet -H "format:parquet" http://127.0.0.1:8030/api/${DB}/amazon_reviews/_stream_load
  5. curl --location-trusted -u root: -T amazon_reviews_2014.snappy.parquet -H "format:parquet" http://127.0.0.1:8030/api/${DB}/amazon_reviews/_stream_load
  6. curl --location-trusted -u root: -T amazon_reviews_2015.snappy.parquet -H "format:parquet" http://127.0.0.1:8030/api/${DB}/amazon_reviews/_stream_load

SQL 运行 count() 确认导入数据成功

  1. mysql> SELECT COUNT() FROM amazon_reviews;
  2. +-----------+
  3. | count(*) |
  4. +-----------+
  5. | 135589433 |
  6. +-----------+

查询

首先在没有索引的时候运行查询,WHERE 条件中有 LIKE,耗时7.60s

  1. SELECT
  2. product_id,
  3. any(product_title),
  4. AVG(star_rating) AS rating,
  5. COUNT() AS count
  6. FROM
  7. amazon_reviews
  8. WHERE
  9. review_body LIKE '%is super awesome%'
  10. GROUP BY
  11. product_id
  12. ORDER BY
  13. count DESC,
  14. rating DESC,
  15. product_id
  16. LIMIT 5;
  17. +------------+------------------------------------------+--------------------+-------+
  18. | product_id | any_value(product_title) | rating | count |
  19. +------------+------------------------------------------+--------------------+-------+
  20. | B00992CF6W | Minecraft | 4.8235294117647056 | 17 |
  21. | B009UX2YAC | Subway Surfers | 4.7777777777777777 | 9 |
  22. | B00DJFIMW6 | Minion Rush: Despicable Me Official Game | 4.875 | 8 |
  23. | B0086700CM | Temple Run | 5 | 6 |
  24. | B00KWVZ750 | Angry Birds Epic RPG | 5 | 6 |
  25. +------------+------------------------------------------+--------------------+-------+
  26. 5 rows in set (7.60 sec)

然后添加 NGram BloomFilter 索引,再次运行相同的查询耗时0.93s,性能提升了8倍

  1. ALTER TABLE amazon_reviews ADD INDEX review_body_ngram_idx(review_body) USING NGRAM_BF PROPERTIES("gram_size"="10", "bf_size"="10240");
  1. +------------+------------------------------------------+--------------------+-------+
  2. | product_id | any_value(product_title) | rating | count |
  3. +------------+------------------------------------------+--------------------+-------+
  4. | B00992CF6W | Minecraft | 4.8235294117647056 | 17 |
  5. | B009UX2YAC | Subway Surfers | 4.7777777777777777 | 9 |
  6. | B00DJFIMW6 | Minion Rush: Despicable Me Official Game | 4.875 | 8 |
  7. | B0086700CM | Temple Run | 5 | 6 |
  8. | B00KWVZ750 | Angry Birds Epic RPG | 5 | 6 |
  9. +------------+------------------------------------------+--------------------+-------+
  10. 5 rows in set (0.93 sec)