数据库内核月报－ 2016/10 - MySQL · 特性分析 · 直方图的实现与分析 - 《数据库内核月报》

MySQL 直方图的功能
MySQL 直方图的实现
总结
Footnotes

直方图（Histogram）是 RDBMS 中提供的一种基础的统计信息，最典型的用途是估计查询谓词的选择率，以便选择优化的查询执行计划。常见的直方图种类有：等宽直方图、等高直方图、V-优化的直方图，MaxDiff 直方图等等。RDBMS 产品最初使用的直方图非常简单（只有一个桶），后来逐步演化到等宽直方图、等高直方图等。MariaDB 10.0.2 就已在 server 层实现了直方图功能，参考Take into account the selectivity 和 Histogram based statistics。MySQL 在8.0.0 中也引入了直方图，参考WL#8706和WL8707。

MySQL 直方图的功能

直方图会持久化存储到一个新的系统表 mysql.column_stats，表名与 MariaDB 的一样，但是定义是不同的。直方图的主要数据保存在一个 JSON 类型的名为 histogram 的列中。因为 8.0 的字典表都采用了 InnoDB 引擎，这个表也不例外。该特性支持所有的数据类型，包括数值类型、字符串、大对象、枚举类型等，也支持 GENERATED COLUMN。

MySQL 支持两种类型的直方图，第一种是等宽直方图的一种特殊情况，每个桶只有一个值，因此只需要保存该值和累积的频率。另一种是等高直方图，每个桶需要保存下界、上界、累积频率以及不同值的个数（Number of Distinct Value，NDV）。这两种直方图与 Oracle 的是类似的，见Histograms Part 1/Part 2/Part 3。

执行 ANALYZE TABLE [table] UPDATE HISTOGRAMS 命令可以产生表上各列的直方图，默认情况下这些信息会被复制到备库。

在文件 scripts/mysql_systemtables.sql 中可以看到该表的定义：

--
-- Column statistics
--
CREATE TABLE IF NOT EXISTS column_stats (
  database_name VARCHAR(64) NOT NULL,
  table_name VARCHAR(64) NOT NULL,
  column_name VARCHAR(64) NOT NULL,
  histogram JSON NOT NULL,
  PRIMARY KEY (database_name, table_name, column_name)
) ENGINE=InnoDB CHARACTER SET=utf8 COLLATE=utf8_bin
COMMENT="Column statistics";

下面是这两种直方图的示例。

Equi-height JSON definition
---------------------------
{
  // Last time the histogram was updated. As of now, this means "when the
  // histogram was created" (incremental updates are not supported). Date/time
  // is given in UTC.
  // -- J_DATETIME
  "last-updated": "2015-11-04 15:19:51.000000",
  // Histogram type. Always "equi-height" for equi-height histograms.
  // -- J_STRING
  "histogram-type": "equi-height",
  // Histogram buckets. This will always be at least one bucket.
  // -- J_ARRAY
  "buckets":
  [
    [
      // Lower inclusive value.
      // -- Data type depends on the source column.
      "0",
      // Upper inclusive value.
      // -- Data type depends on the source column.
      "002a38227ecc7f0d952e85ffe37832d3f58910da",
      // Cumulative frequence
      // -- J_DOUBLE
      0.001978728666831561,
      // Number of distinct values in this bucket.
      // -- J_UINT
      10
    ]
  ]
}
Singleton JSON definition
-------------------------
{
  // Last time the histogram was updated. As of now, this means "when the
  // histogram was created" (incremental updates are not supported). Date/time
  // is given in UTC.
  // -- J_DATETIME
  "last-updated": "2015-11-04 15:19:51.000000",
  // Histogram type. Always "singleton" for singleton histograms.
  // -- J_STRING
  "histogram-type": "singleton",
  // Histogram buckets. This will always be at least one bucket.
  // -- J_ARRAY
  "buckets":
  [
    [
      // Value value.
      // -- Data type depends on the source column.
      42,
      // Cumulative frequence
      // -- J_DOUBLE
      0.001978728666831561,
    ]
  ]
}

MySQL 直方图的实现

MySQL 8.0 的代码做过不少重整，目录结构也比以前清楚多了。直方图的源代码都在目录sql/histograms 下，包括以下文件。

equi_height_bucket.cc
equi_height_bucket.h
equi_height.cc
equi_height.h
histogram.cc
histogram.h
singleton.cc
singleton.h

对应的单元测试文件为：unittest/gunit/histograms-t.cc。可以看到，代码用到了 C++11 的一些特性，并且还写了比较完整的单元测试，可读性很好。代码主要部分是这三个类：直方图的基类 Histogram，以及实现等宽直方图、等高直方图的两个类 Singleton 和 Equi_height。

对外的主要接口是创建直方图的函数：

template <class T>
Histogram *build_histogram(MEM_ROOT *mem_root,
                           const value_map_type<T> &value_map,
                           ha_rows num_null_values, size_t num_buckets,
                           std::string db_name, std::string tbl_name,
                           std::string col_name)

输入的数据需要放到一个 map 里头，表示每个值以及对应的出现次数，map 是按照值排序的。直方图一般不会对表中的所有数据逐行进行分析建立，这样做的代价太高了；很多实现都是通过对数据采样进行的。因此，这里用 map 而不是 iterator 也是比较自然的。如果桶的个数（num_buckets）比不同值的个数要大，则自动选择创建一个等宽直方图；否则创建一个等高直方图。

/*
  If the number of buckets specified is greater or equal to the number
  of distinct values, we create a Singleton histogram. Otherwise we create
  an equi-height histogram.
*/
 if (num_buckets >= value_map.size())
 {
   Singleton<T> *singleton=
     new(mem_root) Singleton<T>(mem_root, db_name, tbl_name, col_name);
..
   if (singleton->build_histogram(value_map, num_null_values))
     return nullptr;                         /* purecov: inspected */
..
 }
 else
 {
   Equi_height<T> *equi_height=
     new(mem_root) Equi_height<T>(mem_root, db_name, tbl_name, col_name);
..

两种直方图的创建逻辑都比较简单，可以参看： Singleton::build~histogram~() 和 Equi~height~::build~histogram~()。

总结

通过参考资料中的内容，与 Oracle、MariaDB 做个对比，很容易发现 MySQL 8.0 目前实现的直方图还只是提供了最基础的功能，还不能用来改进查询执行计划。

MySQL · 特性分析 · 直方图的实现与分析

MySQL 直方图的功能

MySQL 直方图的实现

总结

Footnotes