新聞中心

這里有您想知道的互聯(lián)網(wǎng)營銷解決方案

快速上手Thanos：高可用的Prometheus

快速上手Thanos：高可用的 Prometheus

作者：進(jìn)擊云原生 2022-06-04 07:26:47
云計(jì)算在本文中，我將介紹使用Thanos在EKS多集群架構(gòu)上存儲多個集群的Prometheus指標(biāo)的思考過程和經(jīng)驗(yàn)教訓(xùn)。

站在用戶的角度思考問題，與客戶深入溝通，找到札達(dá)網(wǎng)站設(shè)計(jì)與札達(dá)網(wǎng)站推廣的解決方案，憑借多年的經(jīng)驗(yàn)，讓設(shè)計(jì)與互聯(lián)網(wǎng)技術(shù)結(jié)合，創(chuàng)造個性化、用戶體驗(yàn)好的作品，建站類型包括：做網(wǎng)站、網(wǎng)站設(shè)計(jì)、企業(yè)官網(wǎng)、英文網(wǎng)站、手機(jī)端網(wǎng)站、網(wǎng)站推廣、域名與空間、網(wǎng)絡(luò)空間、企業(yè)郵箱。業(yè)務(wù)覆蓋札達(dá)地區(qū)。

在一個成千上萬的服務(wù)和應(yīng)用程序部署在多個基礎(chǔ)設(shè)施中的世界中，在高可用性環(huán)境中進(jìn)行監(jiān)控已成為每個開發(fā)過程的重要組成部分。

在本文中，我將介紹使用Thanos在EKS多集群架構(gòu)上存儲多個集群的Prometheus指標(biāo)的思考過程和經(jīng)驗(yàn)教訓(xùn)。

介紹

隨著 HiredScore 的產(chǎn)品和客戶群越來越大，我們開始向 Kubernetes 過渡并迅速采用它，它是我們重要的障礙之一，也可能是最大的監(jiān)控基礎(chǔ)設(shè)施。我們在使用 Prometheus / Grafana 堆棧進(jìn)行監(jiān)控方面有一些經(jīng)驗(yàn)，我們了解到我們希望創(chuàng)建一個更好、高可用性和彈性的基礎(chǔ)架構(gòu)，具有可行且具有成本效益的數(shù)據(jù)保留，此外，它還允許我們?yōu)镠iredScore的高速增長做好準(zhǔn)備。

CNCF 推廣了多種基礎(chǔ)設(shè)施，可以解決這些監(jiān)控痛點(diǎn)，并實(shí)現(xiàn)具有高可用性、數(shù)據(jù)保留和成本效益的監(jiān)控。

要求

單點(diǎn)可觀察性將聚合來自任何區(qū)域的所有集群的所有數(shù)據(jù)。
Prometheus 的高可用性和彈性基礎(chǔ)架構(gòu)。
我們所有應(yīng)用程序數(shù)據(jù)的數(shù)據(jù)保留。
經(jīng)濟(jì)高效的解決方案。

我們選擇了Bitnami的Kube-Prometheus解決方案和Thanos-io 的Kube-Thanos解決方案。該解決方案效果很好，并成功滿足了我們的所有需求。

讓我們來認(rèn)識一下players：

Prometheus? — 是用于事件監(jiān)控和警報的免費(fèi)軟件應(yīng)用程序。它在使用 HTTP 拉取模型構(gòu)建的時間序列數(shù)據(jù)庫中記錄實(shí)時指標(biāo)，具有靈活的查詢和實(shí)時警報。
Thanos? — 一個基于 Prometheus? 組件的開源 CNCF ?沙盒項(xiàng)目，用于創(chuàng)建全球規(guī)模的高可用性監(jiān)控系統(tǒng)。它通過幾個簡單的步驟無縫地擴(kuò)展了 Prometheus。

它是如何工作的？

正如您在圖中所看到的，每個EKS集群在同一個名稱空間中擁有兩個Prometheus pods，它們通過抓取集群行為來監(jiān)視它們。每個Prometheus pods在專用PVC中保存最后幾個小時，在規(guī)定的保留時間后，數(shù)據(jù)通過Thanos sidecar發(fā)送到S3桶。通過這種方式，我們可以在少量本地存儲上節(jié)省成本，并將其他所有存儲都集中在一個地方(S3)。

為了顯示來自 k8s 集群的 Grafana 數(shù)據(jù)，我們創(chuàng)建了一個專用集群，負(fù)責(zé)使用連接到thanos-sidecar容器的 GRPC 直接從每個集群收集所有實(shí)時（最近約 2 小時）數(shù)據(jù)（暴露默認(rèn)情況下在端口 10901 上）并從 S3 存儲桶（配置存儲）中獲取遠(yuǎn)程數(shù)據(jù)。

讓我們深入了解實(shí)現(xiàn)細(xì)節(jié)：

第一階段是在每個集群中實(shí)現(xiàn)kube-prometheus?和 Thanos sidecar。
第二階段是在“聚合”集群中實(shí)現(xiàn)kube-thanos? 。它將負(fù)責(zé)從集群中收集所有集群的實(shí)時數(shù)據(jù)，并從發(fā)送到 S3 ?存儲桶（ObjectStore）的保留數(shù)據(jù)中收集數(shù)據(jù)。

聽起來很棒，那么我們實(shí)際上如何做到這一點(diǎn)呢？

第一階段

在這里，我們關(guān)注如何在我們要監(jiān)控的每個集群中部署和配置 Prometheus 以及 Thanos sidecar。在每個集群中創(chuàng)建一個名為 monitoring 的命名空間：

kubectl create ns monitoring

創(chuàng)建一個存儲類以使 Prometheus 能夠持久化日期

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: prometheus-storage-class
provisioner: kubernetes.io/aws-ebs
parameters:
  type: gp3
reclaimPolicy: Retain
allowVolumeExpansion: true
volumeBindingMode: Immediate

kubectl apply -f prometheus-storage-class.yaml -n monitoring

安裝 kube-prometheus：

helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo update

將要配置的相關(guān)值復(fù)制到本地文件夾中。需要在值中應(yīng)用的一些更改：

第1步：使 Prometheus 高可用：設(shè)置Prometheus Replica Count— 所需的Prometheus副本數(shù)（超過2個）

https://github.com/bitnami/charts/blob/master/bitnami/kube-prometheus/values.yaml
https://github.com/bitnami/charts/blob/46afe376ae87a5af32504bc230a25d9c7e4522e2/bitnami/kube-prometheus/values.yaml#L760

## @param prometheus.replicaCount Number of Prometheus replicas desired
  ##
  replicaCount: 2

第2步：定義 pod 資源限制Prometheus 資源-定義它以避免Prometheus消耗所有服務(wù)資源。

resources:
  requests:
    cpu: 512m
    memory: 3072Mi
  limits:
    cpu: 512m
    memory: 4096Mi

第 3 步：啟用 Thanos Sidecar 創(chuàng)建

thanos:
  ## @param prometheus.thanos.create Create a Thanos sidecar container
  ##
  create: true

第4步：將Thanos sidecar 服務(wù)類型從更改ClusterIP為LoadBalancer- 它將創(chuàng)建一個AWS經(jīng)典負(fù)載均衡器端點(diǎn)，該端點(diǎn)將在GRPC端口 ( 10901) 中公開 sidecar，然后我們可以使用此端點(diǎn)通過 route53 將其路由到某個 DNS 名稱thanos-prometheus-(cluster_name)。在您自己的集群中公開 Thanos 端點(diǎn)prometheus.thanos.service：

https://github.com/bitnami/charts/blob/46afe376ae87a5af32504bc230a25d9c7e4522e2/bitnami/kube-prometheus/values.yaml#L1034

service:
  type: LoadBalancer
  port: 10901
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-internal: "true"

現(xiàn)在，在創(chuàng)建 CLB 之后，我們需要在kube-thanos清單中實(shí)現(xiàn)它。我們稍后會在第二階段討論。

第 5 步：禁用壓縮并定義保留——這是通過 Thanos sidecar 上傳數(shù)據(jù)的一個非常重要的步驟：

https://prometheus.io/docs/prometheus/latest/storage/#operational-aspects

為了使用 Thanos 邊車上傳，這兩個值必須相等--storage.tsdb.min-block-duration，--storage.tsdb.max-block-duration默認(rèn)情況下，它們設(shè)置為2小時。Prometheus 的保留時間建議不低于 min block duration 的3倍，即6小時。可以在此處找到其他說明：

https://thanos.io/tip/components/sidecar.md/

retention: 12h

disableCompaction: true

第 6 步：啟用配置密鑰——通過啟用對象存儲配置，我們可以將數(shù)據(jù)寫入 S3 或任何其他受支持的BlockDevice。以確保我們長期數(shù)據(jù)的持久性。

## @param prometheus.thanos.objectStorageConfig Support mounting a Secret for the objectStorageConfig of the sideCar container.
objectStorageConfig:
  secretName: thanos-objstore-config
  secretKey: thanos.yaml

雖然源文件thanos-storage-config.yaml必須采用這種形式，

type: s3
config:
  bucket: thanos-store #S3 bucket name
  endpoint: s3..amazonaws.com #S3 Regional endpoint
  access_key: 
  secret_key:

值得一提的是，目前我們只能使用單個 S3 存儲桶（ObjectStore）使用以下命令創(chuàng)建密鑰：

kubectl -n monitoring create secret generic thanos-objstore-config --from-file=thanos.yaml=thanos-storage-config.yaml

第 7 步：現(xiàn)在我們可以使用我們的相關(guān)自定義來安裝/升級 helm chart。

helm install kube-prometheus -f values.yaml bitnami/kube-prometheus -n monitoring

或者

helm upgrade kube-prometheus -f values.yaml bitnami/kube-prometheus -n monitoring

如果你做到了這里，你現(xiàn)在應(yīng)該已經(jīng)運(yùn)行帶有 Thanos sidecar 容器的 Prometheus pod，一方面通過GRPC將抓取的數(shù)據(jù)發(fā)送到清單，另一方面，相同的 sidecar 發(fā)送（大約 2 小時后）數(shù)據(jù)到S3存儲桶（配置存儲）。恭喜！

第二階段

我們專注于如何在主要的可觀察性集群上部署和配置 Thanos 。如前所述，它將負(fù)責(zé)從我們在第一階段部署的所有集群中收集所有數(shù)據(jù)。

為此，我們使用kube-thanos manifests。我們發(fā)現(xiàn)，就我們的目的而言，我們只需要實(shí)現(xiàn)查詢和存儲部分。

第1步：安裝和自定義kube-thanos：在主可觀察性集群中創(chuàng)建一個名為thanos的命名空間：

kubectl create ns thanos

您可以選擇克隆kube-thanos存儲庫并使用清單文件夾或自己編譯kube-thanos清單。最后一個不需要您復(fù)制整個存儲庫，只需要清單文件。

第2步：在您通過第一階段后，我們將負(fù)責(zé)thanos-query-deployment.yaml從第一階段開始與其他集群之間的通信。為此，我們需要添加以下內(nèi)容：

- --store=dnssrv+_grpc._tcp.thanos-prometheus-.:10901

進(jìn)入args我們在上面公開和定義的Thanos sidecar GRPC端點(diǎn)部分（步驟 4）。

- args:
    - query
    - --grpc-address=0.0.0.0:10901
    - --http-address=0.0.0.0:9090
    - --log.level=info
    - --log.format=logfmt
    - --query.replica-label=prometheus_replica
    - --query.replica-label=rule_replica
    - --store=dnssrv+_grpc._tcp.thanos-store.thanos.svc.cluster.local:10901
    - --store=dnssrv+_grpc._tcp.thanos-receive-ingestor-default.thanos.svc.cluster.local:10901
    - --store=dnssrv+_grpc._tcp.thanos-prometheus-.:10901
    - --query.auto-downsampling

第 3 步：現(xiàn)在，我們將處理thanos-store與我們配置要從第一階段發(fā)送到的數(shù)據(jù)的S3存儲桶（ObjectStore）之間的通信。因此，正如我們在第一步中所做的那樣，我們需要配置一個名稱，該名稱在注入環(huán)境thanos-store-statefulSet.yaml的一部分中請求到 Thanos 存儲 pod：

env:
  - name: OBJSTORE_CONFIG
    valueFrom:
      secretKeyRef:
        key: thanos.yaml
        name: thanos-objectstorage

然后我們可以重用第一階段的相同源文件并為thanos-storethanos-storage-config.yaml創(chuàng)建一個秘密：

kubectl -n thanos create secret generic thanos-objectstorage --from-file=thanos.yaml=thanos-storage-config.yaml

第4步：安裝清單：

kubectl apply -f manifests -n thanos

現(xiàn)在，應(yīng)該關(guān)閉循環(huán)。Thanos 通過thanos-query部署從其他集群接收實(shí)時數(shù)據(jù)，并通過thanos-store-statefulSet保留來自 S3 存儲桶（ObjectStore）的數(shù)據(jù)。

結(jié)論

Thanos 讓我們改變了對 Prometheus 高度可用、耐用和經(jīng)濟(jì)高效的看法，在許多Kubernetes集群上實(shí)施Thanos和 Prometheus 需要付出很多努力，但如果您關(guān)心確保高可用的 Prometheus，這是值得的。

網(wǎng)站標(biāo)題：快速上手Thanos：高可用的Prometheus
文章網(wǎng)址：http://fisionsoft.com.cn/article/ccciope.html