概述
Performance Schema(pfs)是对MySQL的细力度的性能监控诊断工具,覆盖statement/io/memory/lock 等各个性能相关的模块。Pfs采集到的性能数据使用 performance_Schema 引擎存储,全部保存在内存。 本文关注 pfs 的内存管理。首先从代码中分析 pfs 内存管理机制,然后以一个监控项为例介绍 pfs 的流程,最后介绍下 pfs 内存相关的参数。本文代码基于 MySQL 8.0.18版本。
Pfs内存管理
核心数据结构
PFS_buffer_scalable_container PFS_buffer_scalable_container 用于内存管理(申请,扩容,释放),内部结构如下图。 其中,global*container (以下称为 container )为全局单例变量,下面是其示意图以及结构定义代码。Container 存储上分两层: page 和 record。 以 global_thread_container 为例,默认global_thread_container中包含 多个 PFS_thread_array(page), page 内部包含多个 PFS_thread(record)。 PFS_buffer_scalable_container 代码
template <class T, int PFS_PAGE_SIZE, int PFS_PAGE_COUNT,
class U = PFS_buffer_default_array<T>,
class V = PFS_buffer_default_allocator<T>>
class PFS_buffer_scalable_container {
typedef T value_type; // record 类型
typedef U array_type; // page 类型
typedef V allocator_type; // page 分配器,需实现 alloc_array/free_array
value_type *allocate(pfs_dirty_state *dirty_state); // 分配记录
void deallocate(value_type *pfs) { m_array.deallocate(pfs); } // 释放记录
array_type m_array; // 内存起始位置
size_t m_max; // PFS_PAGE_SIZE* PFS_PAGE_COUNT
allocator_type *m_allocator; // 分配器
}
class PFS_thread_allocator {
public:
int alloc_array(PFS_thread_array *array);
void free_array(PFS_thread_array *array);
};
实例化后的 container 对象复制管理 pfs 各个模块的内存分配,其与系统表对应关系如下:
globalaccount_container | events%summary_by_account_by_event_name |
---|---|
global_host_container | events%summary_by_host_by_event_name |
global_thread_container | events%summary_by_thread_by_event_name |
global_user_container | events%_summary_by_user_by_event_name |
global_mutex_container | mutex_instances |
global_rwlock_container | rwlock_instances |
global_cond_container | cond_instances |
global_socket_container | socket_instances |
global_mdl_container | metadata_locks |
Pfs内存管理模型
1) 系统启动的时候预先分配内存,系统运行期间根据需要重新分配内存
Pfs 的内存分配发生在 page 分配(即alloc_array函数),启动时初始化会分配部分page ,系统运行期间若 page 用满会分配新的 page。 在 page 内部分配 record 时,使用原子操作避免加锁。 下面是 global_thread_container
运行期间分配thread 的伪代码。
PFS_thread *pfs = global_thread_container.allocate(&dirty_state)
{
if (m_full) { m_lost++; return NULL; } // 如果container 满了直接返回
while (monotonic < monotonic_max){
array= m_pages[index]
pfs = array->allocate(dirty_state); // 从现有 page 中分配
pfs->m_page= reinterpret_cast<PFS_opaque_container_page *> (array);
return pfs;
}
array = new array_type(); // 分配新 page
int rc= m_allocator->alloc_array(array); // 内部调用PFS_MALLOC_ARRAY分配内存
}
2) Record 采用定长方式存储,每次申请固定数量长度的内存,并初始化填0
真正的内存分配由m_allocator->alloc_array进行,我们以PFS_thread_allocator::alloc_array为例展开代码,PFS_thread中保存了线程粒度下的 statement/wait/error 等数据。 每个PFS_thread对象申请的内存为固定的,以statement为例,MySQL 支持的 statement 类型为220个,每个PFS_thread内会为220个类型提前分配位置并初始化为0,这也是 pfs 内存消耗的重要原因。
int PFS_thread_allocator::alloc_array(PFS_thread_array *array) {
size_t size = array->m_max; // 单个 page 内保存的记录(即 PFS_thread)数
size_t index;
size_t waits_sizing = size * wait_class_max; // wait_class_max 为等待事件的种类
size_t statements_sizing = size * statement_class_max; // statement_class_max 语句类型个数
size_t transactions_sizing = size * transaction_class_max; // 事务类型个数
size_t errors_sizing = (max_server_errors != 0) ? size * error_class_max : 0; // error 类型个数
...
array->m_ptr =
PFS_MALLOC_ARRAY(&builtin_memory_thread, size, sizeof(PFS_thread),
PFS_thread, MYF(MY_ZEROFILL));
array->m_instr_class_waits_array = PFS_MALLOC_ARRAY(
&builtin_memory_thread_waits, waits_sizing, sizeof(PFS_single_stat),
PFS_single_stat, MYF(MY_ZEROFILL));
array->m_instr_class_statements_array = PFS_MALLOC_ARRAY(
&builtin_memory_thread_statements, statements_sizing,
sizeof(PFS_statement_stat), PFS_statement_stat, MYF(MY_ZEROFILL));
array->m_instr_class_errors_array = PFS_MALLOC_ARRAY(
&builtin_memory_host_errors, errors_sizing, sizeof(PFS_error_stat),
PFS_error_stat, MYF(MY_ZEROFILL));
...
}
3) 系统运行期间不释放内存,只在shutdown时 释放内存
下面是thread_container 释放thread 的代码逻辑
global_thread_container.deallocate(pfs);
{ // 只是标记回收,并不会实际释放空间
safe_pfs->m_lock.allocated_to_free();
page->m_full = false;
m_full = false;
}
4) 数据在不同粒度的维度汇总
Pfs 数据库下可以看到对同一个监控指标有很多个不同的表,每个表代表一个统计的维度。
mysql> show tables like '%statement%summary%';
+----------------------------------------------------+
| Tables_in_performance_schema (%statement%summary%) |
+----------------------------------------------------+
| events_statements_summary_by_account_by_event_name |
| events_statements_summary_by_digest |
| events_statements_summary_by_digest_supplement |
| events_statements_summary_by_host_by_event_name |
| events_statements_summary_by_program |
| events_statements_summary_by_thread_by_event_name |
| events_statements_summary_by_user_by_event_name |
| events_statements_summary_global_by_event_name |
+----------------------------------------------------+
在内部,不同的统计维度被称为集合(aggregates),对同一条数据在内部只会保存一份,运行期间会进行从细维度到高纬度的汇总。 pfs.cc代码注释中用这种图表的方式进行了说明,下面 以statement 为例介绍下汇总的过程,读者可以自己理解下。
statement_locker(T, S)
|
| [1]
|
1a |-> pfs_thread(T).event_name(S) =====>> [A], [B], [C], [D], [E]
| |
| | [2]
| |
| 2a |-> pfs_account(U, H).event_name(S) =====>> [B], [C], [D], [E]
| . |
| . | [3-RESET]
| . |
| 2b .....+-> pfs_user(U).event_name(S) =====>> [C]
| . |
| 2c .....+-> pfs_host(H).event_name(S) =====>> [D], [E]
| . . |
| . . | [4-RESET]
| 2d . . |
1b |----+----+----+-> pfs_statement_class(S) =====>> [E]
|
1c |-> pfs_thread(T).statement_current(S) =====>> [F]
|
1d |-> pfs_thread(T).statement_history(S) =====>> [G]
|
1e |-> statement_history_long(S) =====>> [H]
|
1f |-> statement_digest(S) =====>> [I]
@endverbatim
Implemented as:
- [1] #pfs_start_statement_v2(), #pfs_end_statement_v2()
(1a, 1b) is an aggregation by EVENT_NAME,
(1c, 1d, 1e) is an aggregation by TIME,
(1f) is an aggregation by DIGEST
all of these are orthogonal,
and implemented in #pfs_end_statement_v2().
- [2] #pfs_delete_thread_v1(), #aggregate_thread_statements()
- [3] @c PFS_account::aggregate_statements()
- [4] @c PFS_host::aggregate_statements()
- [A] EVENTS_STATEMENTS_SUMMARY_BY_THREAD_BY_EVENT_NAME,
@c table_esms_by_thread_by_event_name::make_row()
- [B] EVENTS_STATEMENTS_SUMMARY_BY_ACCOUNT_BY_EVENT_NAME,
@c table_esms_by_account_by_event_name::make_row()
- [C] EVENTS_STATEMENTS_SUMMARY_BY_USER_BY_EVENT_NAME,
@c table_esms_by_user_by_event_name::make_row()
- [D] EVENTS_STATEMENTS_SUMMARY_BY_HOST_BY_EVENT_NAME,
@c table_esms_by_host_by_event_name::make_row()
- [E] EVENTS_STATEMENTS_SUMMARY_GLOBAL_BY_EVENT_NAME,
@c table_esms_global_by_event_name::make_row()
- [F] EVENTS_STATEMENTS_CURRENT,
@c table_events_statements_current::make_row()
- [G] EVENTS_STATEMENTS_HISTORY,
@c table_events_statements_history::make_row()
- [H] EVENTS_STATEMENTS_HISTORY_LONG,
@c table_events_statements_history_long::make_row()
- [I] EVENTS_STATEMENTS_SUMMARY_BY_DIGEST
@c table_esms_by_digest::make_row()
Pfs性能监控过程
这里以statement 的一个监控项为例来介绍 pfs 性能数据采集的整个过程。 监控数据最终记录在 events_statements_summary_by_thread_by_event_name
表中,需提前打开 setup_consumers.thread_instrumentation
开关。
线程创建
调用入口: PSI_THREAD_CALL(new_thread)
线程启动时进行在全局container( global_thread_container
)中申请内存空间,并进行一系列的监控数据初始化。 首先尝试在现有的 page 中申请空闲的record, 找不到的话申请新的page。
语句开始前
调用入口: MYSQL_START_STATEMENT
在语句开始的位置调用进行,比如 在dispatch_command
函数中,进行statement 统计的初始化,记录 sql 启动时间。
语句结束后
调用入口: MYSQL_END_STATEMENT
pfs_end_statement_v2(PSI_statement_locker *locker, void *stmt_da)
{
PSI_statement_locker_state *state =
reinterpret_cast<PSI_statement_locker_state *>(locker);
// 填充 pfs
PFS_events_statements *pfs =
reinterpret_cast<PFS_events_statements *>(state->m_statement);
insert_events_statements_history(thread, pfs); // 写入到 EVENTS_STATEMENTS_HISTORY
insert_events_statements_history_long(pfs); // 写入到 EVENTS_STATEMENTS_HISTORY_LONG
// 获取写入的位置
event_name_array = thread->write_instr_class_statements_stats(); // PFS_statement_stat*
stat = &event_name_array[index];
// 开始填充 stat,写入汇总表
stat->m_lock_time += state->m_lock_time;
}
线程结束
调用入口: PSI_THREAD_CALL(delete_current_thread)
void pfs_delete_current_thread_vc(void) {
// 将线程的数据汇总到 account 或者 host 统计中
aggregate_thread(thread, thread->m_account, thread->m_user, thread->m_host);
...
// 销毁 pfs thread, global_thread_container 收回空间
global_thread_container.deallocate(pfs);
}
Pfs内存参数设置
主要看下影响pfs内存使用的相关参数
performance_schema%max%instance
控制监控实体的个数,内部即限制对应 container 的容量。
+------------------------------------------------------+-------+
| Variable_name | Value |
+------------------------------------------------------+-------+
| performance_schema_max_cond_instances | -1 |
| performance_schema_max_file_instances | -1 |
| performance_schema_max_mutex_instances | -1 |
| performance_schema_max_prepared_statements_instances | -1 |
| performance_schema_max_program_instances | -1 |
| performance_schema_max_rwlock_instances | -1 |
| performance_schema_max_socket_instances | -1 |
| performance_schema_max_table_instances | -1 |
| performance_schema_max_thread_instances | -1 |
+------------------------------------------------------+-------+
performance_schema_max_cond_instances global_cond_container
performance_schema_max_file_instances global_file_container
performance_schema_max_mutex_instances global_mutex_container
performance_schema_max_prepared_statements_instances global_prepared_stmt_container
performance_schema_max_program_instances global_program_container
performance_schema_max_rwlock_instances global_rwlock_container
performance_schema_max_socket_instances global_socket_container
performance_schema_max_table_instances global_table_share_container
performance_schema_max_thread_instances global_thread_container
performance_schema_%_size
影响对应表的记录上限
ysql> show global variables like 'performance_schema_%_size';
+----------------------------------------------------------+-------+
| Variable_name | Value |
+----------------------------------------------------------+-------+
| performance_schema_accounts_size | -1 |
| performance_schema_digests_size | 100 |
| performance_schema_error_size | 20 |
| performance_schema_events_stages_history_long_size | 10000 |
| performance_schema_events_stages_history_size | 10 |
| performance_schema_events_statements_history_long_size | 10000 |
| performance_schema_events_statements_history_size | 10 |
| performance_schema_events_transactions_history_long_size | 10000 |
| performance_schema_events_transactions_history_size | 10 |
| performance_schema_events_waits_history_long_size | 10000 |
| performance_schema_events_waits_history_size | 10 |
| performance_schema_hosts_size | -1 |
| performance_schema_session_connect_attrs_size | 512 |
| performance_schema_setup_actors_size | -1 |
| performance_schema_setup_objects_size | -1 |
| performance_schema_users_size | -1 |
+----------------------------------------------------------+-------+
其他参数:
performance_schema_error_size: 监控的系统错误码个数,如果对错误码没有监控需求,建议调低 performance_schema_digests_size: events_statements_summary_by_digest 表的最大容量