指标埋点

指标埋点

概述

1. 指标接入说明

2. 指标体系设计

Dubbo的指标体系,总共涉及三块,指标收集、本地聚合、指标推送

  • 指标收集:将Dubbo内部需要监控的指标推送至统一的Collector中进行存储
  • 本地聚合:指标收集获取的均为基础指标,而一些分位数指标则需通过本地聚合计算得出
  • 指标推送:收集和聚合后的指标通过一定的方式推送至第三方服务器,目前只涉及Prometheus

3. 结构设计

  • 移除原来与 Metrics 相关的类
  • 创建新模块 dubbo-metrics/dubbo-metrics-api、dubbo-metrics/dubbo-metrics-prometheus,MetricsConfig 作为该模块的配置类
  • 使用micrometer,在Collector中使用基本类型代表指标,如Long、Double等,并在dubbo-metrics-api中引入micrometer,由micrometer对内部指标进行转换

4. 数据流转

img.png

5. 目标

指标接口将提供一个 MetricsService,该 Service 不仅提供柔性服务所的接口级数据,也提供所有指标的查询方式,其中方法级指标的查询的接口可按如下方式声明

  1. public interface MetricsService {
  2. /**
  3. * Default {@link MetricsService} extension name.
  4. */
  5. String DEFAULT_EXTENSION_NAME = "default";
  6. /**
  7. * The contract version of {@link MetricsService}, the future update must make sure compatible.
  8. */
  9. String VERSION = "1.0.0";
  10. /**
  11. * Get metrics by prefixes
  12. *
  13. * @param categories categories
  14. * @return metrics - key=MetricCategory value=MetricsEntityList
  15. */
  16. Map<MetricsCategory, List<MetricsEntity>> getMetricsByCategories(List<MetricsCategory> categories);
  17. /**
  18. * Get metrics by interface and prefixes
  19. *
  20. * @param serviceUniqueName serviceUniqueName (eg.group/interfaceName:version)
  21. * @param categories categories
  22. * @return metrics - key=MetricCategory value=MetricsEntityList
  23. */
  24. Map<MetricsCategory, List<MetricsEntity>> getMetricsByCategories(String serviceUniqueName, List<MetricsCategory> categories);
  25. /**
  26. * Get metrics by interface、method and prefixes
  27. *
  28. * @param serviceUniqueName serviceUniqueName (eg.group/interfaceName:version)
  29. * @param methodName methodName
  30. * @param parameterTypes method parameter types
  31. * @param categories categories
  32. * @return metrics - key=MetricCategory value=MetricsEntityList
  33. */
  34. Map<MetricsCategory, List<MetricsEntity>> getMetricsByCategories(String serviceUniqueName, String methodName, Class<?>[] parameterTypes, List<MetricsCategory> categories);
  35. }

其中 MetricsCategory 设计如下:

  1. public enum MetricsCategory {
  2. RT,
  3. QPS,
  4. REQUESTS,
  5. }

MetricsEntity 设计如下

  1. public class MetricsEntity {
  2. private String name;
  3. private Map<String, String> tags;
  4. private MetricsCategory category;
  5. private Object value;
  6. }

指标收集

1. 嵌入位置

Dubbo 架构图如下 img.png

在 provider 中添加一层 MetricsFilter 重写 invoke 方法嵌入调用链路用于收集指标,用 try-catch-finally 处理,核心代码如下

  1. @Activate(group = PROVIDER, order = -1)
  2. public class MetricsFilter implements Filter, ScopeModelAware {
  3. @Override
  4. public Result invoke(Invoker<?> invoker, Invocation invocation) throws RpcException {
  5. collector.increaseTotalRequests(interfaceName, methodName, group, version);
  6. collector.increaseProcessingRequests(interfaceName, methodName, group, version);
  7. Long startTime = System.currentTimeMillis();
  8. try {
  9. Result invoke = invoker.invoke(invocation);
  10. collector.increaseSucceedRequests(interfaceName, methodName, group, version);
  11. return invoke;
  12. } catch (RpcException e) {
  13. collector.increaseFailedRequests(interfaceName, methodName, group, version);
  14. throw e;
  15. } finally {
  16. Long endTime = System.currentTimeMillis();
  17. Long rt = endTime - startTime;
  18. collector.addRT(interfaceName, methodName, group, version, rt);
  19. collector.decreaseProcessingRequests(interfaceName, methodName, group, version);
  20. }
  21. }
  22. }

2. 指标标识

用以下五个属性作为隔离级别区分标识不同方法,也是各个 ConcurrentHashMap 的 key

  1. public class MethodMetric {
  2. private String applicationName;
  3. private String interfaceName;
  4. private String methodName;
  5. private String group;
  6. private String version;
  7. }

3. 基础指标

指标通过 common 模块下的 MetricsCollector 存储所有指标数据

  1. public class DefaultMetricsCollector implements MetricsCollector {
  2. private Boolean collectEnabled = false;
  3. private final List<MetricsListener> listeners = new ArrayList<>();
  4. private final ApplicationModel applicationModel;
  5. private final String applicationName;
  6. private final Map<MethodMetric, AtomicLong> totalRequests = new ConcurrentHashMap<>();
  7. private final Map<MethodMetric, AtomicLong> succeedRequests = new ConcurrentHashMap<>();
  8. private final Map<MethodMetric, AtomicLong> failedRequests = new ConcurrentHashMap<>();
  9. private final Map<MethodMetric, AtomicLong> processingRequests = new ConcurrentHashMap<>();
  10. private final Map<MethodMetric, AtomicLong> lastRT = new ConcurrentHashMap<>();
  11. private final Map<MethodMetric, LongAccumulator> minRT = new ConcurrentHashMap<>();
  12. private final Map<MethodMetric, LongAccumulator> maxRT = new ConcurrentHashMap<>();
  13. private final Map<MethodMetric, AtomicLong> avgRT = new ConcurrentHashMap<>();
  14. private final Map<MethodMetric, AtomicLong> totalRT = new ConcurrentHashMap<>();
  15. private final Map<MethodMetric, AtomicLong> rtCount = new ConcurrentHashMap<>();
  16. }

本地聚合

本地聚合指将一些简单的指标通过计算获取各分位数指标的过程

1. 参数设计

收集指标时,默认只收集基础指标,而一些单机聚合指标则需要开启服务柔性或者本地聚合后另起线程计算。此处若开启服务柔性,则本地聚合默认开启

1.1 本地聚合开启方式

  1. <dubbo:metrics>
  2. <dubbo:aggregation enable="true" />
  3. </dubbo:metrics>

1.2 指标聚合参数

  1. <dubbo:metrics>
  2. <dubbo:aggregation enable="true" bucket-num="5" time-window-seconds="10"/>
  3. </dubbo:metrics>

2. 具体指标

Dubbo的指标模块帮助用户从外部观察正在运行的系统的内部服务状况 ,Dubbo参考 “四大黄金信号”RED方法USE方法等理论并结合实际企业应用场景从不同维度统计了丰富的关键指标,关注这些核心指标对于提供可用性的服务是至关重要的。

Dubbo的关键指标包含:延迟(Latency)流量(Traffic)错误(Errors)饱和度(Saturation) 等内容 。同时,为了更好的监测服务运行状态,Dubbo 还提供了对核心组件状态的监控,如Dubbo应用信息、线程池信息、三大中心交互的指标数据等。

在Dubbo中主要包含如下监控指标:

基础设施业务监控
延迟类IO 等待; 网络延迟;接口、服务的平均耗时、TP90、TP99、TP999 等
流量类网络和磁盘 IO;服务层面的 QPS、
错误类宕机; 磁盘(坏盘或文件系统错误); 进程或端口挂掉; 网络丢包;错误日志;业务状态码、错误码走势;
饱和度类系统资源利用率: CPU、内存、磁盘、网络等; 饱和度:等待线程数,队列积压长度;这里主要包含JVM、线程池等
  • qps: 基于滑动窗口获取动态qps
  • rt: 基于滑动窗口获取动态rt
  • 失败请求数: 基于滑动窗口获取最近时间内的失败请求数
  • 成功请求数: 基于滑动窗口获取最近时间内的成功请求数
  • 处理中请求数: 前后增加Filter简单统计
  • 具体指标依赖滑动窗口,额外使用 AggregateMetricsCollector 收集

输出到普罗米修斯的相关指标可以参考的内容如下:

  1. # HELP jvm_gc_live_data_size_bytes Size of long-lived heap memory pool after reclamation
  2. # TYPE jvm_gc_live_data_size_bytes gauge
  3. jvm_gc_live_data_size_bytes 1.6086528E7
  4. # HELP requests_succeed_aggregate Aggregated Succeed Requests
  5. # TYPE requests_succeed_aggregate gauge
  6. requests_succeed_aggregate{application_name="metrics-provider",group="",hostname="iZ8lgm9icspkthZ",interface="org.apache.dubbo.samples.metrics.prometheus.api.DemoService",ip="172.28.236.104",method="sayHello",version="",} 39.0
  7. # HELP jvm_buffer_memory_used_bytes An estimate of the memory that the Java virtual machine is using for this buffer pool
  8. # TYPE jvm_buffer_memory_used_bytes gauge
  9. jvm_buffer_memory_used_bytes{id="direct",} 1.679975E7
  10. jvm_buffer_memory_used_bytes{id="mapped",} 0.0
  11. # HELP jvm_gc_memory_allocated_bytes_total Incremented for an increase in the size of the (young) heap memory pool after one GC to before the next
  12. # TYPE jvm_gc_memory_allocated_bytes_total counter
  13. jvm_gc_memory_allocated_bytes_total 2.9884416E9
  14. # HELP requests_total_aggregate Aggregated Total Requests
  15. # TYPE requests_total_aggregate gauge
  16. requests_total_aggregate{application_name="metrics-provider",group="",hostname="iZ8lgm9icspkthZ",interface="org.apache.dubbo.samples.metrics.prometheus.api.DemoService",ip="172.28.236.104",method="sayHello",version="",} 39.0
  17. # HELP system_load_average_1m The sum of the number of runnable entities queued to available processors and the number of runnable entities running on the available processors averaged over a period of time
  18. # TYPE system_load_average_1m gauge
  19. system_load_average_1m 0.0
  20. # HELP system_cpu_usage The "recent cpu usage" for the whole system
  21. # TYPE system_cpu_usage gauge
  22. system_cpu_usage 0.015802269043760128
  23. # HELP jvm_threads_peak_threads The peak live thread count since the Java virtual machine started or peak was reset
  24. # TYPE jvm_threads_peak_threads gauge
  25. jvm_threads_peak_threads 40.0
  26. # HELP requests_processing Processing Requests
  27. # TYPE requests_processing gauge
  28. requests_processing{application_name="metrics-provider",group="",hostname="iZ8lgm9icspkthZ",interface="org.apache.dubbo.samples.metrics.prometheus.api.DemoService",ip="172.28.236.104",method="sayHello",version="",} 0.0
  29. # HELP jvm_memory_max_bytes The maximum amount of memory in bytes that can be used for memory management
  30. # TYPE jvm_memory_max_bytes gauge
  31. jvm_memory_max_bytes{area="nonheap",id="CodeHeap 'profiled nmethods'",} 1.22912768E8
  32. jvm_memory_max_bytes{area="heap",id="G1 Survivor Space",} -1.0
  33. jvm_memory_max_bytes{area="heap",id="G1 Old Gen",} 9.52107008E8
  34. jvm_memory_max_bytes{area="nonheap",id="Metaspace",} -1.0
  35. jvm_memory_max_bytes{area="heap",id="G1 Eden Space",} -1.0
  36. jvm_memory_max_bytes{area="nonheap",id="CodeHeap 'non-nmethods'",} 5828608.0
  37. jvm_memory_max_bytes{area="nonheap",id="Compressed Class Space",} 1.073741824E9
  38. jvm_memory_max_bytes{area="nonheap",id="CodeHeap 'non-profiled nmethods'",} 1.22916864E8
  39. # HELP jvm_threads_states_threads The current number of threads having BLOCKED state
  40. # TYPE jvm_threads_states_threads gauge
  41. jvm_threads_states_threads{state="blocked",} 0.0
  42. jvm_threads_states_threads{state="runnable",} 10.0
  43. jvm_threads_states_threads{state="waiting",} 16.0
  44. jvm_threads_states_threads{state="timed-waiting",} 13.0
  45. jvm_threads_states_threads{state="new",} 0.0
  46. jvm_threads_states_threads{state="terminated",} 0.0
  47. # HELP jvm_buffer_total_capacity_bytes An estimate of the total capacity of the buffers in this pool
  48. # TYPE jvm_buffer_total_capacity_bytes gauge
  49. jvm_buffer_total_capacity_bytes{id="direct",} 1.6799749E7
  50. jvm_buffer_total_capacity_bytes{id="mapped",} 0.0
  51. # HELP rt_p99 Response Time P99
  52. # TYPE rt_p99 gauge
  53. rt_p99{application_name="metrics-provider",group="",hostname="iZ8lgm9icspkthZ",interface="org.apache.dubbo.samples.metrics.prometheus.api.DemoService",ip="172.28.236.104",method="sayHello",version="",} 1.0
  54. # HELP jvm_memory_used_bytes The amount of used memory
  55. # TYPE jvm_memory_used_bytes gauge
  56. jvm_memory_used_bytes{area="heap",id="G1 Survivor Space",} 1048576.0
  57. jvm_memory_used_bytes{area="nonheap",id="CodeHeap 'profiled nmethods'",} 1.462464E7
  58. jvm_memory_used_bytes{area="heap",id="G1 Old Gen",} 1.6098728E7
  59. jvm_memory_used_bytes{area="nonheap",id="Metaspace",} 4.0126952E7
  60. jvm_memory_used_bytes{area="heap",id="G1 Eden Space",} 8.2837504E7
  61. jvm_memory_used_bytes{area="nonheap",id="CodeHeap 'non-nmethods'",} 1372032.0
  62. jvm_memory_used_bytes{area="nonheap",id="Compressed Class Space",} 4519248.0
  63. jvm_memory_used_bytes{area="nonheap",id="CodeHeap 'non-profiled nmethods'",} 5697408.0
  64. # HELP qps Query Per Seconds
  65. # TYPE qps gauge
  66. qps{application_name="metrics-provider",group="",hostname="iZ8lgm9icspkthZ",interface="org.apache.dubbo.samples.metrics.prometheus.api.DemoService",ip="172.28.236.104",method="sayHello",version="",} 0.3333333333333333
  67. # HELP rt_min Min Response Time
  68. # TYPE rt_min gauge
  69. rt_min{application_name="metrics-provider",group="",hostname="iZ8lgm9icspkthZ",interface="org.apache.dubbo.samples.metrics.prometheus.api.DemoService",ip="172.28.236.104",method="sayHello",version="",} 0.0
  70. # HELP jvm_buffer_count_buffers An estimate of the number of buffers in the pool
  71. # TYPE jvm_buffer_count_buffers gauge
  72. jvm_buffer_count_buffers{id="mapped",} 0.0
  73. jvm_buffer_count_buffers{id="direct",} 10.0
  74. # HELP system_cpu_count The number of processors available to the Java virtual machine
  75. # TYPE system_cpu_count gauge
  76. system_cpu_count 2.0
  77. # HELP jvm_classes_loaded_classes The number of classes that are currently loaded in the Java virtual machine
  78. # TYPE jvm_classes_loaded_classes gauge
  79. jvm_classes_loaded_classes 7325.0
  80. # HELP rt_total Total Response Time
  81. # TYPE rt_total gauge
  82. rt_total{application_name="metrics-provider",group="",hostname="iZ8lgm9icspkthZ",interface="org.apache.dubbo.samples.metrics.prometheus.api.DemoService",ip="172.28.236.104",method="sayHello",version="",} 2783.0
  83. # HELP rt_last Last Response Time
  84. # TYPE rt_last gauge
  85. rt_last{application_name="metrics-provider",group="",hostname="iZ8lgm9icspkthZ",interface="org.apache.dubbo.samples.metrics.prometheus.api.DemoService",ip="172.28.236.104",method="sayHello",version="",} 0.0
  86. # HELP jvm_gc_memory_promoted_bytes_total Count of positive increases in the size of the old generation memory pool before GC to after GC
  87. # TYPE jvm_gc_memory_promoted_bytes_total counter
  88. jvm_gc_memory_promoted_bytes_total 1.4450952E7
  89. # HELP jvm_gc_pause_seconds Time spent in GC pause
  90. # TYPE jvm_gc_pause_seconds summary
  91. jvm_gc_pause_seconds_count{action="end of minor GC",cause="Metadata GC Threshold",} 2.0
  92. jvm_gc_pause_seconds_sum{action="end of minor GC",cause="Metadata GC Threshold",} 0.026
  93. jvm_gc_pause_seconds_count{action="end of minor GC",cause="G1 Evacuation Pause",} 37.0
  94. jvm_gc_pause_seconds_sum{action="end of minor GC",cause="G1 Evacuation Pause",} 0.156
  95. # HELP jvm_gc_pause_seconds_max Time spent in GC pause
  96. # TYPE jvm_gc_pause_seconds_max gauge
  97. jvm_gc_pause_seconds_max{action="end of minor GC",cause="Metadata GC Threshold",} 0.0
  98. jvm_gc_pause_seconds_max{action="end of minor GC",cause="G1 Evacuation Pause",} 0.0
  99. # HELP rt_p95 Response Time P95
  100. # TYPE rt_p95 gauge
  101. rt_p95{application_name="metrics-provider",group="",hostname="iZ8lgm9icspkthZ",interface="org.apache.dubbo.samples.metrics.prometheus.api.DemoService",ip="172.28.236.104",method="sayHello",version="",} 0.0
  102. # HELP requests_total Total Requests
  103. # TYPE requests_total gauge
  104. requests_total{application_name="metrics-provider",group="",hostname="iZ8lgm9icspkthZ",interface="org.apache.dubbo.samples.metrics.prometheus.api.DemoService",ip="172.28.236.104",method="sayHello",version="",} 27738.0
  105. # HELP process_cpu_usage The "recent cpu usage" for the Java Virtual Machine process
  106. # TYPE process_cpu_usage gauge
  107. process_cpu_usage 8.103727714748784E-4
  108. # HELP rt_max Max Response Time
  109. # TYPE rt_max gauge
  110. rt_max{application_name="metrics-provider",group="",hostname="iZ8lgm9icspkthZ",interface="org.apache.dubbo.samples.metrics.prometheus.api.DemoService",ip="172.28.236.104",method="sayHello",version="",} 4.0
  111. # HELP jvm_gc_max_data_size_bytes Max size of long-lived heap memory pool
  112. # TYPE jvm_gc_max_data_size_bytes gauge
  113. jvm_gc_max_data_size_bytes 9.52107008E8
  114. # HELP jvm_threads_live_threads The current number of live threads including both daemon and non-daemon threads
  115. # TYPE jvm_threads_live_threads gauge
  116. jvm_threads_live_threads 39.0
  117. # HELP jvm_threads_daemon_threads The current number of live daemon threads
  118. # TYPE jvm_threads_daemon_threads gauge
  119. jvm_threads_daemon_threads 36.0
  120. # HELP jvm_classes_unloaded_classes_total The total number of classes unloaded since the Java virtual machine has started execution
  121. # TYPE jvm_classes_unloaded_classes_total counter
  122. jvm_classes_unloaded_classes_total 0.0
  123. # HELP jvm_memory_committed_bytes The amount of memory in bytes that is committed for the Java virtual machine to use
  124. # TYPE jvm_memory_committed_bytes gauge
  125. jvm_memory_committed_bytes{area="nonheap",id="CodeHeap 'profiled nmethods'",} 1.4680064E7
  126. jvm_memory_committed_bytes{area="heap",id="G1 Survivor Space",} 1048576.0
  127. jvm_memory_committed_bytes{area="heap",id="G1 Old Gen",} 5.24288E7
  128. jvm_memory_committed_bytes{area="nonheap",id="Metaspace",} 4.1623552E7
  129. jvm_memory_committed_bytes{area="heap",id="G1 Eden Space",} 9.0177536E7
  130. jvm_memory_committed_bytes{area="nonheap",id="CodeHeap 'non-nmethods'",} 2555904.0
  131. jvm_memory_committed_bytes{area="nonheap",id="Compressed Class Space",} 5111808.0
  132. jvm_memory_committed_bytes{area="nonheap",id="CodeHeap 'non-profiled nmethods'",} 5701632.0
  133. # HELP requests_succeed Succeed Requests
  134. # TYPE requests_succeed gauge
  135. requests_succeed{application_name="metrics-provider",group="",hostname="iZ8lgm9icspkthZ",interface="org.apache.dubbo.samples.metrics.prometheus.api.DemoService",ip="172.28.236.104",method="sayHello",version="",} 27738.0
  136. # HELP rt_avg Average Response Time
  137. # TYPE rt_avg gauge
  138. rt_avg{application_name="metrics-provider",group="",hostname="iZ8lgm9icspkthZ",interface="org.apache.dubbo.samples.metrics.prometheus.api.DemoService",ip="172.28.236.104",method="sayHello",version="",} 0.0

聚合收集器

  1. public class AggregateMetricsCollector implements MetricsCollector, MetricsListener {
  2. private int bucketNum;
  3. private int timeWindowSeconds;
  4. private final Map<MethodMetric, TimeWindowCounter> totalRequests = new ConcurrentHashMap<>();
  5. private final Map<MethodMetric, TimeWindowCounter> succeedRequests = new ConcurrentHashMap<>();
  6. private final Map<MethodMetric, TimeWindowCounter> failedRequests = new ConcurrentHashMap<>();
  7. private final Map<MethodMetric, TimeWindowCounter> qps = new ConcurrentHashMap<>();
  8. private final Map<MethodMetric, TimeWindowQuantile> rt = new ConcurrentHashMap<>();
  9. private final ApplicationModel applicationModel;
  10. private static final Integer DEFAULT_COMPRESSION = 100;
  11. private static final Integer DEFAULT_BUCKET_NUM = 10;
  12. private static final Integer DEFAULT_TIME_WINDOW_SECONDS = 120;
  13. //在构造函数中解析配置信息
  14. public AggregateMetricsCollector(ApplicationModel applicationModel) {
  15. this.applicationModel = applicationModel;
  16. ConfigManager configManager = applicationModel.getApplicationConfigManager();
  17. MetricsConfig config = configManager.getMetrics().orElse(null);
  18. if (config != null && config.getAggregation() != null && Boolean.TRUE.equals(config.getAggregation().getEnabled())) {
  19. // only registered when aggregation is enabled.
  20. registerListener();
  21. AggregationConfig aggregation = config.getAggregation();
  22. this.bucketNum = aggregation.getBucketNum() == null ? DEFAULT_BUCKET_NUM : aggregation.getBucketNum();
  23. this.timeWindowSeconds = aggregation.getTimeWindowSeconds() == null ? DEFAULT_TIME_WINDOW_SECONDS : aggregation.getTimeWindowSeconds();
  24. }
  25. }
  26. }

如果开启了本地聚合,则通过 spring 的 BeanFactory 添加监听,将 AggregateMetricsCollector 与 DefaultMetricsCollector 绑定,实现一种生存者消费者的模式,DefaultMetricsCollector 中使用监听器列表,方便扩展

  1. private void registerListener() {
  2. applicationModel.getBeanFactory().getBean(DefaultMetricsCollector.class).addListener(this);
  3. }

3. 指标聚合

滑动窗口 假设我们初始有6个bucket,每个窗口时间设置为2分钟 每次写入指标数据时,会将数据分别写入6个bucket内,每隔两分钟移动一个bucket并且清除原来bucket内的数据 读取指标时,读取当前current指向的bucket,以达到滑动窗口的效果 具体如下图所示,实现了当前 bucket 内存储了配置中设置的 bucket 生命周期内的数据,即近期数据 img_1.png

在每个bucket内,使用TDigest 算法计算分位数指标

TDigest 算法(极端分位精确度高,如p1 p99,中间分位精确度低,如p50),相关资料如下

代码实现如下,除了 TimeWindowQuantile 用来计算分位数指标外,另外提供了 TimeWindowCounter 来收集时间区间内的指标数量

  1. public class TimeWindowQuantile {
  2. private final double compression;
  3. private final TDigest[] ringBuffer;
  4. private int currentBucket;
  5. private long lastRotateTimestampMillis;
  6. private final long durationBetweenRotatesMillis;
  7. public TimeWindowQuantile(double compression, int bucketNum, int timeWindowSeconds) {
  8. this.compression = compression;
  9. this.ringBuffer = new TDigest[bucketNum];
  10. for (int i = 0; i < bucketNum; i++) {
  11. this.ringBuffer[i] = TDigest.createDigest(compression);
  12. }
  13. this.currentBucket = 0;
  14. this.lastRotateTimestampMillis = System.currentTimeMillis();
  15. this.durationBetweenRotatesMillis = TimeUnit.SECONDS.toMillis(timeWindowSeconds) / bucketNum;
  16. }
  17. public synchronized double quantile(double q) {
  18. TDigest currentBucket = rotate();
  19. return currentBucket.quantile(q);
  20. }
  21. public synchronized void add(double value) {
  22. rotate();
  23. for (TDigest bucket : ringBuffer) {
  24. bucket.add(value);
  25. }
  26. }
  27. private TDigest rotate() {
  28. long timeSinceLastRotateMillis = System.currentTimeMillis() - lastRotateTimestampMillis;
  29. while (timeSinceLastRotateMillis > durationBetweenRotatesMillis) {
  30. ringBuffer[currentBucket] = TDigest.createDigest(compression);
  31. if (++currentBucket >= ringBuffer.length) {
  32. currentBucket = 0;
  33. }
  34. timeSinceLastRotateMillis -= durationBetweenRotatesMillis;
  35. lastRotateTimestampMillis += durationBetweenRotatesMillis;
  36. }
  37. return ringBuffer[currentBucket];
  38. }
  39. }

指标推送

指标推送只有用户在设置了<dubbo:metrics />配置且配置protocol参数后才开启,若只开启指标聚合,则默认不推送指标。

1. Promehteus Pull ServiceDiscovery

使用dubbo-admin等类似的中间层,启动时根据配置将本机 IP、Port、MetricsURL 推送地址信息至dubbo-admin(或任意中间层)的方式,暴露HTTP ServiceDiscovery供prometheus读取,配置方式如<dubbo:metrics protocol=“prometheus” mode=“pull” address=”${dubbo-admin.address}” port=“20888” url=”/metrics”/>,其中在pull模式下address为可选参数,若不填则需用户手动在Prometheus配置文件中配置地址

  1. private void exportHttpServer() {
  2. boolean exporterEnabled = url.getParameter(PROMETHEUS_EXPORTER_ENABLED_KEY, false);
  3. if (exporterEnabled) {
  4. int port = url.getParameter(PROMETHEUS_EXPORTER_METRICS_PORT_KEY, PROMETHEUS_DEFAULT_METRICS_PORT);
  5. String path = url.getParameter(PROMETHEUS_EXPORTER_METRICS_PATH_KEY, PROMETHEUS_DEFAULT_METRICS_PATH);
  6. if (!path.startsWith("/")) {
  7. path = "/" + path;
  8. }
  9. try {
  10. prometheusExporterHttpServer = HttpServer.create(new InetSocketAddress(port), 0);
  11. prometheusExporterHttpServer.createContext(path, httpExchange -> {
  12. String response = prometheusRegistry.scrape();
  13. httpExchange.sendResponseHeaders(200, response.getBytes().length);
  14. try (OutputStream os = httpExchange.getResponseBody()) {
  15. os.write(response.getBytes());
  16. }
  17. });
  18. httpServerThread = new Thread(prometheusExporterHttpServer::start);
  19. httpServerThread.start();
  20. } catch (IOException e) {
  21. throw new RuntimeException(e);
  22. }
  23. }
  24. }

2. Prometheus Push Pushgateway

用户直接在Dubbo配置文件中配置Prometheus Pushgateway的地址即可,如<dubbo:metrics protocol=“prometheus” mode=“push” address=”${prometheus.pushgateway-url}” interval=“5” />,其中interval代表推送间隔

  1. private void schedulePushJob() {
  2. boolean pushEnabled = url.getParameter(PROMETHEUS_PUSHGATEWAY_ENABLED_KEY, false);
  3. if (pushEnabled) {
  4. String baseUrl = url.getParameter(PROMETHEUS_PUSHGATEWAY_BASE_URL_KEY);
  5. String job = url.getParameter(PROMETHEUS_PUSHGATEWAY_JOB_KEY, PROMETHEUS_DEFAULT_JOB_NAME);
  6. int pushInterval = url.getParameter(PROMETHEUS_PUSHGATEWAY_PUSH_INTERVAL_KEY, PROMETHEUS_DEFAULT_PUSH_INTERVAL);
  7. String username = url.getParameter(PROMETHEUS_PUSHGATEWAY_USERNAME_KEY);
  8. String password = url.getParameter(PROMETHEUS_PUSHGATEWAY_PASSWORD_KEY);
  9. NamedThreadFactory threadFactory = new NamedThreadFactory("prometheus-push-job", true);
  10. pushJobExecutor = Executors.newScheduledThreadPool(1, threadFactory);
  11. PushGateway pushGateway = new PushGateway(baseUrl);
  12. if (!StringUtils.isBlank(username)) {
  13. pushGateway.setConnectionFactory(new BasicAuthHttpConnectionFactory(username, password));
  14. }
  15. pushJobExecutor.scheduleWithFixedDelay(() -> push(pushGateway, job), pushInterval, pushInterval, TimeUnit.SECONDS);
  16. }
  17. }
  18. protected void push(PushGateway pushGateway, String job) {
  19. try {
  20. pushGateway.pushAdd(prometheusRegistry.getPrometheusRegistry(), job);
  21. } catch (IOException e) {
  22. logger.error("Error occurred when pushing metrics to prometheus: ", e);
  23. }
  24. }

可视化展示

目前推荐使用 Prometheus 来进行服务监控,Grafana 来展示指标数据。可以通过案例来快速入门 Dubbo 可视化监控

最后修改 April 15, 2023: Update metrics.md (#2538) (51028074e33)