保姆级教程:K8s 1.23集群上部署kube-prometheus 0.10(含国内镜像源替换与网络策略避坑)
从零构建K8s监控体系kube-prometheus实战指南与深度调优当容器化应用规模突破临界点传统的日志排查和性能分析手段会瞬间失效。去年某电商大促期间运维团队曾因无法实时感知Pod内存泄漏导致集群雪崩。这正是我们需要在Kubernetes集群中部署专业监控系统的根本原因——没有可观测性的容器编排就像蒙眼驾驶F1赛车。1. 环境准备构建标准化K8s基础1.1 集群版本精准匹配策略选择Kubernetes 1.23与kube-prometheus 0.10的组合并非偶然。这个版本组合经过社区长期验证其API兼容性矩阵如下组件最低K8s版本推荐版本Prometheus Operator1.191.21kube-state-metrics1.181.20Grafana无要求8.5关键验证步骤kubectl version --short | grep Server git clone https://github.com/prometheus-operator/kube-prometheus.git -b release-0.101.2 基础设施预配置在/opt/k8s目录下建立标准化工作区mkdir -p /opt/k8s/{manifests,images,backup} chmod 755 /opt/k8s2. 镜像加速破解国内部署困局2.1 智能镜像替换方案针对quay.io和k8s.gcr.io的访问难题采用多级回退策略首选USTC镜像源sed -i s/quay.io/quay.mirrors.ustc.edu.cn/g manifests/*.yaml备用阿里云镜像sed -i s/k8s.gcr.io/registry.aliyuncs.com\/google_containers/g manifests/kubeStateMetrics-deployment.yaml本地镜像应急方案docker save -o /opt/k8s/images/prometheus-v2.30.3.tar quay.mirrors.ustc.edu.cn/prometheus/prometheus:v2.30.32.2 镜像预加载技巧使用并行拉取加速准备过程grep image: manifests/* -r | awk -F image: {print $2} | xargs -P 4 -I {} docker pull {}3. 部署实战避坑操作手册3.1 服务暴露方式优化修改manifests/grafana-service.yaml暴露服务apiVersion: v1 kind: Service metadata: name: grafana spec: type: NodePort ports: - name: http port: 3000 targetPort: http nodePort: 33000 selector: app.kubernetes.io/name: grafana3.2 关键部署流程分阶段执行确保CRD就绪kubectl apply --server-side -f manifests/setup kubectl wait --forconditionEstablished crd --all --timeout300s kubectl apply -f manifests/4. 网络策略与安全调优4.1 访问控制白名单替代直接删除NetworkPolicy的更优方案# manifests/prometheus-networkPolicy.yaml apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: prometheus-allow-specific spec: podSelector: matchLabels: app.kubernetes.io/name: prometheus ingress: - from: - ipBlock: cidr: 192.168.1.0/244.2 持久化存储配置为Prometheus添加PVC声明# manifests/prometheus-prometheus.yaml apiVersion: monitoring.coreos.com/v1 kind: Prometheus metadata: name: k8s spec: storage: volumeClaimTemplate: spec: storageClassName: standard resources: requests: storage: 100Gi5. 高级监控场景实践5.1 自定义指标采集配置PodMonitor监控自定义应用# manifests/custom-podmonitor.yaml apiVersion: monitoring.coreos.com/v1 kind: PodMonitor metadata: name: payment-service spec: selector: matchLabels: app: payment podMetricsEndpoints: - port: metrics interval: 15s5.2 告警规则管理创建业务级告警规则示例# manifests/custom-alert.yaml apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: business-alerts spec: groups: - name: payment.rules rules: - alert: HighPaymentErrorRate expr: rate(payment_errors_total[5m]) 0.1 for: 10m labels: severity: critical annotations: summary: High error rate on {{ $labels.instance }}在Grafana中导入ID为315的Kubernetes集群监控大盘后突然发现某个Node的CPU使用率持续超过90%。通过PromQL查询sum by (instance) (rate(node_cpu_seconds_total{mode!idle}[5m]))最终定位到某个未设置资源限制的批处理Pod。这种问题在监控系统上线前通常需要数小时才能发现现在只需30秒。