高可用保障方案

本文介绍 ZadigX 高可用保障相关的配置和建议。

高可用保障

包括 K8s 基础设施、ZadigX 系统组件和数据库的高可用。

K8s 资源高可用

基础资源的高可用,由 Kubernetes 集群的提供者(公有云/自建云)保障。

应用高可用

提供以下建议保障应用高可用:

  • 使用 K8s 原生的 deployment 部署,保证服务出现极端情况异常退出后可以被迅速拉起
  • 通过内置的 metric 接口,结合 Prometheus 以及预配置的 Grafana 面板实时展现 ZadigX 组件的性能指标(CPU / Memory 利用率)
  • 结合 Prometheus alertmanager 配置核心组件监控报警,报警规则建议如下:
    • aslan 组件:当 CPU 占用率超过 2C,或内存占用超过 4G,持续超过 2 分钟时触发报警
    • cron 组件:当 CPU 占用率超过 1C,或者内存占用超过 512M,持续超过 2 分钟时触发报警
    • dind 组件:当 CPU 占用率超过 4C,或者内存占用超过 8G,持续超过 2 分钟时触发报警
  • 备份当前安装参数以及 Chart:
    • 通过 helm get values <ZadigX Release Name> -n <ZadigX Namespace> > ZadigX.yaml 获取当前安装参数
    • 通过 helm pull koderover-chart/zadigx --version=<version> 备份当前 Chart

提示

ZadigX 中部分组件支持多副本策略,若有必要可以按需调整实际副本数。支持多副本配置的组件如下:

  • deployment/zadig-portal
  • deployment/opa
  • deployment/resource-server
  • deployment/warpdrive
  • deployment/gateway-proxy
  • statefulSet/dind

数据高可用

建议安装时使用高可用 MongoDB、MySQL 存储介质,并对 ZadigX 系统数据定时备份。建议如下:

  • 使用高可用的数据库,以阿里云为例:
    • MySQL 建议使用 RDS MySQL 高可用版
    • MongoDB 选用独占型实例
  • 使用高可用的制品仓库:
    • 使用云厂商提供的高可用镜像仓库 / Helm 仓库
    • 若使用 K8s Helm Chart 项目,建议集成高可用的对象存储并设为默认使用,数据备份建议:一天一次
  • 定期备份数据库数据
    • 若使用公有云,需要手动设置公有云 MySQL/MongoDB 实例的自动备份策略
    • 备份策略建议:一天一次
    • MinIO 备份策略:一天一次

监控告警配置

推荐使用 Prometheus高可用保障方案 - 图1 (opens new window) 作为监控和性能指标信息存储方案,使用 Grafana高可用保障方案 - 图2 (opens new window) 作为可视化组件进行展示。

Prometheus 配置

  1. 在 Prometheus 配置的 scrape_configs 中添加如下 job:

如果 ZadigX 和 Prometheus 在同一集群中

  1. job_name: prometheus
  2. metrics_path: /api/metrics
  3. static_configs:
  4. - targets:
  5. - aslan.<部署 namespace>.svc:25000

如果 ZadigX 和 Prometheus 在不同集群中

  1. job_name: admin
  2. metrics_path: /api/aslan/metrics
  3. static_configs:
  4. - targets:
  5. - <ZadigX的访问域名>
  6. scheme: https
  1. 重新加载 Prometheus 配置
  2. 在 Prometheus 的 query 界面,输入 request_total 进行查询,确认有数据,证明配置成功

Grafana 配置

  1. [可选项]在 Grafana 的 configuration - Data source 中配置上文的 Prometheus 为数据源
  2. 在 dashboards - import 中输入如下 JSON 导入 Grafana 面板
  1. {
  2. "annotations": {
  3. "list": [
  4. {
  5. "builtIn": 1,
  6. "datasource": {
  7. "type": "grafana",
  8. "uid": "-- Grafana --"
  9. },
  10. "enable": true,
  11. "hide": true,
  12. "iconColor": "rgba(0, 211, 255, 1)",
  13. "name": "Annotations & Alerts",
  14. "target": {
  15. "limit": 100,
  16. "matchAny": false,
  17. "tags": [],
  18. "type": "dashboard"
  19. },
  20. "type": "dashboard"
  21. }
  22. ]
  23. },
  24. "editable": true,
  25. "fiscalYearStartMonth": 0,
  26. "graphTooltip": 0,
  27. "id": 1,
  28. "links": [],
  29. "liveNow": false,
  30. "panels": [
  31. {
  32. "fieldConfig": {
  33. "defaults": {
  34. "color": {
  35. "mode": "palette-classic"
  36. },
  37. "custom": {
  38. "axisCenteredZero": false,
  39. "axisColorMode": "text",
  40. "axisLabel": "",
  41. "axisPlacement": "auto",
  42. "barAlignment": 0,
  43. "drawStyle": "line",
  44. "fillOpacity": 0,
  45. "gradientMode": "none",
  46. "hideFrom": {
  47. "legend": false,
  48. "tooltip": false,
  49. "viz": false
  50. },
  51. "lineInterpolation": "linear",
  52. "lineWidth": 3,
  53. "pointSize": 5,
  54. "scaleDistribution": {
  55. "type": "linear"
  56. },
  57. "showPoints": "auto",
  58. "spanNulls": false,
  59. "stacking": {
  60. "group": "A",
  61. "mode": "none"
  62. },
  63. "thresholdsStyle": {
  64. "mode": "off"
  65. }
  66. },
  67. "mappings": [],
  68. "thresholds": {
  69. "mode": "absolute",
  70. "steps": [
  71. {
  72. "color": "green",
  73. "value": null
  74. },
  75. {
  76. "color": "red",
  77. "value": 80
  78. }
  79. ]
  80. }
  81. },
  82. "overrides": []
  83. },
  84. "gridPos": {
  85. "h": 9,
  86. "w": 12,
  87. "x": 0,
  88. "y": 0
  89. },
  90. "id": 2,
  91. "options": {
  92. "legend": {
  93. "calcs": [],
  94. "displayMode": "list",
  95. "placement": "right",
  96. "showLegend": true
  97. },
  98. "tooltip": {
  99. "mode": "single",
  100. "sort": "none"
  101. }
  102. },
  103. "targets": [
  104. {
  105. "editorMode": "code",
  106. "exemplar": false,
  107. "expr": "sum by(instance) (rate(request_total{status=\"200\"}[1m]))",
  108. "format": "time_series",
  109. "hide": false,
  110. "instant": false,
  111. "interval": "",
  112. "legendFormat": "200",
  113. "range": true,
  114. "refId": "A"
  115. },
  116. {
  117. "editorMode": "code",
  118. "expr": "sum by(instance) (rate(request_total{status=~\"4..\"}[1m]))",
  119. "hide": false,
  120. "legendFormat": "4xx",
  121. "range": true,
  122. "refId": "B"
  123. },
  124. {
  125. "editorMode": "code",
  126. "expr": "sum by(instance) (rate(request_total{status=\"5..\"}[1m]))",
  127. "hide": false,
  128. "legendFormat": "5xx",
  129. "range": true,
  130. "refId": "C"
  131. }
  132. ],
  133. "title": "QPS追踪",
  134. "type": "timeseries"
  135. },
  136. {
  137. "description": "",
  138. "fieldConfig": {
  139. "defaults": {
  140. "color": {
  141. "mode": "thresholds"
  142. },
  143. "mappings": [],
  144. "thresholds": {
  145. "mode": "absolute",
  146. "steps": [
  147. {
  148. "color": "green",
  149. "value": null
  150. },
  151. {
  152. "color": "red",
  153. "value": 80
  154. }
  155. ]
  156. }
  157. },
  158. "overrides": []
  159. },
  160. "gridPos": {
  161. "h": 9,
  162. "w": 6,
  163. "x": 12,
  164. "y": 0
  165. },
  166. "id": 4,
  167. "options": {
  168. "orientation": "auto",
  169. "reduceOptions": {
  170. "calcs": [
  171. "lastNotNull"
  172. ],
  173. "fields": "",
  174. "values": false
  175. },
  176. "showThresholdLabels": false,
  177. "showThresholdMarkers": true
  178. },
  179. "pluginVersion": "9.4.7",
  180. "targets": [
  181. {
  182. "editorMode": "code",
  183. "expr": "running_workflows",
  184. "legendFormat": "__auto",
  185. "range": true,
  186. "refId": "A"
  187. }
  188. ],
  189. "title": "运行中的工作流",
  190. "type": "gauge"
  191. },
  192. {
  193. "description": "",
  194. "fieldConfig": {
  195. "defaults": {
  196. "color": {
  197. "mode": "thresholds"
  198. },
  199. "mappings": [],
  200. "thresholds": {
  201. "mode": "absolute",
  202. "steps": [
  203. {
  204. "color": "green",
  205. "value": null
  206. },
  207. {
  208. "color": "red",
  209. "value": 80
  210. }
  211. ]
  212. }
  213. },
  214. "overrides": []
  215. },
  216. "gridPos": {
  217. "h": 9,
  218. "w": 6,
  219. "x": 18,
  220. "y": 0
  221. },
  222. "id": 6,
  223. "options": {
  224. "orientation": "auto",
  225. "reduceOptions": {
  226. "calcs": [
  227. "lastNotNull"
  228. ],
  229. "fields": "",
  230. "values": false
  231. },
  232. "showThresholdLabels": false,
  233. "showThresholdMarkers": true
  234. },
  235. "pluginVersion": "9.4.7",
  236. "targets": [
  237. {
  238. "editorMode": "code",
  239. "expr": "pending_workflows",
  240. "legendFormat": "__auto",
  241. "range": true,
  242. "refId": "A"
  243. }
  244. ],
  245. "title": "排队中的工作流",
  246. "type": "gauge"
  247. },
  248. {
  249. "fieldConfig": {
  250. "defaults": {
  251. "color": {
  252. "mode": "palette-classic"
  253. },
  254. "custom": {
  255. "axisCenteredZero": false,
  256. "axisColorMode": "text",
  257. "axisLabel": "",
  258. "axisPlacement": "auto",
  259. "barAlignment": 0,
  260. "drawStyle": "line",
  261. "fillOpacity": 0,
  262. "gradientMode": "none",
  263. "hideFrom": {
  264. "legend": false,
  265. "tooltip": false,
  266. "viz": false
  267. },
  268. "lineInterpolation": "linear",
  269. "lineWidth": 1,
  270. "pointSize": 5,
  271. "scaleDistribution": {
  272. "type": "linear"
  273. },
  274. "showPoints": "auto",
  275. "spanNulls": false,
  276. "stacking": {
  277. "group": "A",
  278. "mode": "none"
  279. },
  280. "thresholdsStyle": {
  281. "mode": "off"
  282. }
  283. },
  284. "mappings": [],
  285. "thresholds": {
  286. "mode": "absolute",
  287. "steps": [
  288. {
  289. "color": "green",
  290. "value": null
  291. },
  292. {
  293. "color": "red",
  294. "value": 80
  295. }
  296. ]
  297. }
  298. },
  299. "overrides": []
  300. },
  301. "gridPos": {
  302. "h": 8,
  303. "w": 12,
  304. "x": 0,
  305. "y": 9
  306. },
  307. "id": 8,
  308. "options": {
  309. "legend": {
  310. "calcs": [],
  311. "displayMode": "list",
  312. "placement": "right",
  313. "showLegend": true
  314. },
  315. "tooltip": {
  316. "mode": "single",
  317. "sort": "none"
  318. }
  319. },
  320. "targets": [
  321. {
  322. "editorMode": "code",
  323. "expr": "cpu",
  324. "legendFormat": "{{service}}",
  325. "range": true,
  326. "refId": "A"
  327. }
  328. ],
  329. "title": "CPU 消耗",
  330. "type": "timeseries"
  331. },
  332. {
  333. "fieldConfig": {
  334. "defaults": {
  335. "color": {
  336. "mode": "palette-classic"
  337. },
  338. "custom": {
  339. "axisCenteredZero": false,
  340. "axisColorMode": "text",
  341. "axisLabel": "",
  342. "axisPlacement": "auto",
  343. "barAlignment": 0,
  344. "drawStyle": "line",
  345. "fillOpacity": 0,
  346. "gradientMode": "none",
  347. "hideFrom": {
  348. "legend": false,
  349. "tooltip": false,
  350. "viz": false
  351. },
  352. "lineInterpolation": "linear",
  353. "lineWidth": 1,
  354. "pointSize": 5,
  355. "scaleDistribution": {
  356. "type": "linear"
  357. },
  358. "showPoints": "auto",
  359. "spanNulls": false,
  360. "stacking": {
  361. "group": "A",
  362. "mode": "none"
  363. },
  364. "thresholdsStyle": {
  365. "mode": "off"
  366. }
  367. },
  368. "mappings": [],
  369. "thresholds": {
  370. "mode": "absolute",
  371. "steps": [
  372. {
  373. "color": "green",
  374. "value": null
  375. },
  376. {
  377. "color": "red",
  378. "value": 80
  379. }
  380. ]
  381. }
  382. },
  383. "overrides": []
  384. },
  385. "gridPos": {
  386. "h": 8,
  387. "w": 12,
  388. "x": 12,
  389. "y": 9
  390. },
  391. "id": 10,
  392. "options": {
  393. "legend": {
  394. "calcs": [],
  395. "displayMode": "list",
  396. "placement": "right",
  397. "showLegend": true
  398. },
  399. "tooltip": {
  400. "mode": "single",
  401. "sort": "none"
  402. }
  403. },
  404. "targets": [
  405. {
  406. "editorMode": "code",
  407. "expr": "memory",
  408. "legendFormat": "{{service}}",
  409. "range": true,
  410. "refId": "A"
  411. }
  412. ],
  413. "title": "内存消耗(MB)",
  414. "type": "timeseries"
  415. },
  416. {
  417. "fieldConfig": {
  418. "defaults": {
  419. "color": {
  420. "mode": "thresholds"
  421. },
  422. "custom": {
  423. "align": "auto",
  424. "cellOptions": {
  425. "type": "auto"
  426. },
  427. "inspect": false
  428. },
  429. "mappings": [],
  430. "thresholds": {
  431. "mode": "absolute",
  432. "steps": [
  433. {
  434. "color": "green",
  435. "value": null
  436. },
  437. {
  438. "color": "red",
  439. "value": 80
  440. }
  441. ]
  442. }
  443. },
  444. "overrides": [
  445. {
  446. "matcher": {
  447. "id": "byName",
  448. "options": "慢接口"
  449. },
  450. "properties": [
  451. {
  452. "id": "custom.width",
  453. "value": 1301
  454. }
  455. ]
  456. }
  457. ]
  458. },
  459. "gridPos": {
  460. "h": 8,
  461. "w": 12,
  462. "x": 0,
  463. "y": 17
  464. },
  465. "id": 12,
  466. "options": {
  467. "footer": {
  468. "countRows": false,
  469. "fields": "",
  470. "reducer": [
  471. "sum"
  472. ],
  473. "show": false
  474. },
  475. "frameIndex": 0,
  476. "showHeader": true,
  477. "sortBy": []
  478. },
  479. "pluginVersion": "9.4.7",
  480. "targets": [
  481. {
  482. "editorMode": "code",
  483. "exemplar": false,
  484. "expr": "topk(10, sum by(method, handler) (api_response_time_bucket{status=\"200\",le=\"+Inf\"}) - sum by(method, handler) (api_response_time_bucket{status=\"200\",le=\"1.4\"}))",
  485. "format": "time_series",
  486. "hide": false,
  487. "legendFormat": "__auto",
  488. "range": true,
  489. "refId": "A"
  490. }
  491. ],
  492. "title": "慢接口列表",
  493. "transformations": [
  494. {
  495. "id": "reduce",
  496. "options": {
  497. "includeTimeField": false,
  498. "labelsToFields": false,
  499. "mode": "seriesToRows",
  500. "reducers": [
  501. "last"
  502. ]
  503. }
  504. },
  505. {
  506. "id": "sortBy",
  507. "options": {
  508. "fields": {},
  509. "sort": [
  510. {
  511. "desc": true,
  512. "field": "Last"
  513. }
  514. ]
  515. }
  516. },
  517. {
  518. "id": "organize",
  519. "options": {
  520. "excludeByName": {},
  521. "indexByName": {},
  522. "renameByName": {
  523. "Field": "慢接口",
  524. "Last": "请求数量"
  525. }
  526. }
  527. }
  528. ],
  529. "type": "table"
  530. }
  531. ],
  532. "refresh": "5s",
  533. "revision": 1,
  534. "schemaVersion": 38,
  535. "style": "dark",
  536. "tags": [],
  537. "templating": {
  538. "list": []
  539. },
  540. "time": {
  541. "from": "now-5m",
  542. "to": "now"
  543. },
  544. "timepicker": {},
  545. "timezone": "",
  546. "title": "ZadigX 监控面板",
  547. "uid": "w1U8k1xVk",
  548. "version": 2,
  549. "weekStart": ""
  550. }
  1. 在面板列表中找到 ZadigX 监控面板, 确认数据正常展示。

FAQs

参考文档:部署运维 FAQs