一、环境准备1.1 检查Kubernetes集群状态# 检查集群节点状态 kubectl get nodes -o wide # 检查集群组件状态 kubectl get cs # 检查存储类 kubectl get storageclass1.2 创建必要目录# 创建工作目录 mkdir -p k8s-monitoring cd k8s-monitoring mkdir -p manifests logs二、资源监控系统部署2.1 创建监控命名空间kubectl create namespace monitoring2.2 准备监控配置2.2.1 创建Prometheus Stack配置文件cat prometheus-values.yaml EOF # 请替换以下配置中的占位符 # INTERNAL_REGISTRY - 替换为内网镜像仓库地址 # STORAGE_CLASS - 替换为实际的存储类名称 # GRAFANA_PASSWORD - 替换为Grafana管理员密码 global: imageRegistry: INTERNAL_REGISTRY imagePullSecrets: [regcred] prometheusOperator: serviceMonitorSelectorNilUsesHelmValues: false serviceMonitorSelector: {} podMonitorSelectorNilUsesHelmValues: false podMonitorSelector: {} prometheus: prometheusSpec: retention: 10d scrapeInterval: 30s evaluationInterval: 30s resources: requests: memory: 400Mi cpu: 200m limits: memory: 2Gi cpu: 1000m storageSpec: volumeClaimTemplate: spec: accessModes: [ReadWriteOnce] storageClassName: STORAGE_CLASS resources: requests: storage: 50Gi serviceMonitorSelectorNilUsesHelmValues: false serviceMonitorSelector: {} ruleSelectorNilUsesHelmValues: false ruleSelector: {} kube-state-metrics: resources: requests: memory: 32Mi cpu: 10m limits: memory: 128Mi cpu: 100m nodeExporter: resources: requests: memory: 30Mi cpu: 10m limits: memory: 50Mi cpu: 200m grafana: adminUser: admin adminPassword: GRAFANA_PASSWORD persistence: enabled: true size: 10Gi storageClassName: STORAGE_CLASS alertmanager: enabled: false EOF # 使用sed命令替换占位符或手动编辑 sed -i s/INTERNAL_REGISTRY/registry.internal.company.com/g prometheus-values.yaml sed -i s/STORAGE_CLASS/standard/g prometheus-values.yaml sed -i s/GRAFANA_PASSWORD/admin123/g prometheus-values.yaml2.3 安装Prometheus Stack# 1. 在外网环境下载chart包 helm pull prometheus-community/kube-prometheus-stack --version 45.0.0 # 2. 将chart包传输到内网环境 # 假设chart包已放置在当前目录 # 3. 解压并安装 tar -xzf kube-prometheus-stack-45.0.0.tgz helm install prometheus-stack ./kube-prometheus-stack \ -n monitoring \ -f prometheus-values.yaml2.4 验证监控安装# 检查所有Pod状态 kubectl get pods -n monitoring -w # 等待所有Pod变为Running状态后执行以下检查 kubectl get all -n monitoring # 检查持久化卷声明 kubectl get pvc -n monitoring # 测试Prometheus服务 kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090 # 浏览器访问 http://localhost:9090 # 测试Grafana服务 kubectl port-forward -n monitoring svc/prometheus-stack-grafana 3000:80 # 浏览器访问 http://localhost:3000 # 用户名: admin, 密码: GRAFANA_PASSWORD # 关闭端口转发 kill %1 %2三、日志收集系统部署3.1 创建日志命名空间kubectl create namespace logging3.2 部署RBAC权限3.2.1 创建ServiceAccount和ClusterRolecat fluent-bit-rbac.yaml EOF apiVersion: v1 kind: ServiceAccount metadata: name: fluent-bit namespace: logging --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: fluent-bit-read rules: - apiGroups: [] resources: - namespaces - pods - pods/logs verbs: [get, list, watch] --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: fluent-bit-read roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: fluent-bit-read subjects: - kind: ServiceAccount name: fluent-bit namespace: logging EOF kubectl apply -f fluent-bit-rbac.yaml3.3 创建Fluent Bit配置3.3.1 创建ConfigMapcat fluent-bit-configmap.yaml EOF apiVersion: v1 kind: ConfigMap metadata: name: fluent-bit-config namespace: logging data: fluent-bit.conf: | [SERVICE] Flush 5 Log_Level info Daemon off Parsers_File parsers.conf HTTP_Server On HTTP_Listen 0.0.0.0 HTTP_Port 2020 INCLUDE input-kubernetes.conf INCLUDE filter-kubernetes.conf INCLUDE output-file.conf input-kubernetes.conf: | [INPUT] Name tail Tag kube.* Path /var/log/containers/*.log Parser docker DB /var/log/flb_kube.db DB.Sync Normal Mem_Buf_Limit 5MB Skip_Long_Lines On Refresh_Interval 10 filter-kubernetes.conf: | [FILTER] Name kubernetes Match kube.* Kube_URL https://kubernetes.default.svc:443 Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token Kube_Tag_Prefix kube.var.log.containers. Merge_Log On Merge_Log_Key log_processed Keep_Log Off K8S-Logging.Parser On K8S-Logging.Exclude Off Labels On Annotations Off [FILTER] Name modify Match * Add node_name ${NODE_NAME} Add host_ip ${HOST_IP} output-file.conf: | [OUTPUT] Name file Match * Path /var/log/k8s-logs/ Format template Template {time}-{kubernetes[namespace_name]}-{kubernetes[pod_name]}-{kubernetes[container_name]}.log Retry_Limit False parsers.conf: | [PARSER] Name docker Format json Time_Key time Time_Format %Y-%m-%dT%H:%M:%S.%LZ Time_Keep On Decode_Field_As escaped_utf8 log do_next Decode_Field_As json log EOF kubectl apply -f fluent-bit-configmap.yaml3.4 部署Fluent Bit DaemonSet3.4.1 创建DaemonSetcat fluent-bit-daemonset.yaml EOF # 请替换 INTERNAL_REGISTRY 为内网镜像仓库地址 apiVersion: apps/v1 kind: DaemonSet metadata: name: fluent-bit namespace: logging spec: selector: matchLabels: k8s-app: fluent-bit-logging template: metadata: labels: k8s-app: fluent-bit-logging spec: serviceAccountName: fluent-bit tolerations: - key: node-role.kubernetes.io/master operator: Exists effect: NoSchedule - key: node-role.kubernetes.io/control-plane operator: Exists effect: NoSchedule containers: - name: fluent-bit image: INTERNAL_REGISTRY/fluent/fluent-bit:2.1.9 imagePullPolicy: IfNotPresent env: - name: NODE_NAME valueFrom: fieldRef: fieldPath: spec.nodeName - name: HOST_IP valueFrom: fieldRef: fieldPath: status.hostIP resources: requests: memory: 50Mi cpu: 10m limits: memory: 200Mi cpu: 500m volumeMounts: - name: varlog mountPath: /var/log readOnly: true - name: varlibdockercontainers mountPath: /var/lib/docker/containers readOnly: true - name: fluent-bit-config mountPath: /fluent-bit/etc/ - name: flb-storage mountPath: /var/log/flb-storage/ - name: fluent-bit-token mountPath: /var/run/secrets/kubernetes.io/serviceaccount readOnly: true livenessProbe: httpGet: path: /api/v1/health port: 2020 initialDelaySeconds: 30 periodSeconds: 30 readinessProbe: httpGet: path: /api/v1/health port: 2020 initialDelaySeconds: 5 periodSeconds: 10 volumes: - name: varlog hostPath: path: /var/log - name: varlibdockercontainers hostPath: path: /var/lib/docker/containers - name: fluent-bit-config configMap: name: fluent-bit-config - name: flb-storage hostPath: path: /var/log/flb-storage type: DirectoryOrCreate - name: fluent-bit-token projected: sources: - serviceAccountToken: audience: fluent-bit expirationSeconds: 3600 path: token EOF # 替换镜像地址 sed -i s/INTERNAL_REGISTRY/registry.internal.company.com/g fluent-bit-daemonset.yaml kubectl apply -f fluent-bit-daemonset.yaml3.5 配置Node节点日志存储3.5.1 在每个Node上执行日志目录配置# 创建配置脚本 cat setup-node-logs.sh EOF #!/bin/bash # 创建日志存储目录 LOG_DIR/var/log/k8s-logs FLB_STORAGE_DIR/var/log/flb-storage mkdir -p $LOG_DIR mkdir -p $FLB_STORAGE_DIR chmod 755 $LOG_DIR chmod 755 $FLB_STORAGE_DIR # 创建logrotate配置 cat /etc/logrotate.d/k8s-pod-logs LOGROTATE_EOF /var/log/k8s-logs/*.log { daily rotate 30 compress delaycompress missingok notifempty create 0644 root root dateext dateformat -%Y%m%d sharedscripts postrotate find /var/log/k8s-logs/ -name *.log.*.gz -mtime 60 -delete endscript } LOGROTATE_EOF echo Node日志配置完成 echo 日志目录: $LOG_DIR echo Fluent Bit存储目录: $FLB_STORAGE_DIR EOF # 设置脚本权限 chmod x setup-node-logs.sh # 将脚本复制到所有Node并执行 # 注意需要SSH访问所有Node节点 # 示例假设节点列表 NODESnode1 node2 node3 for NODE in $NODES; do scp setup-node-logs.sh $NODE:/tmp/ ssh $NODE sudo /tmp/setup-node-logs.sh done3.6 验证日志收集部署# 检查DaemonSet状态 kubectl get daemonset -n logging # 检查Pod状态 kubectl get pods -n logging -o wide # 查看Fluent Bit日志 kubectl logs -n logging -l k8s-appfluent-bit-logging --tail20 # 检查Fluent Bit配置 kubectl exec -n logging -it $(kubectl get pod -n logging -l k8s-appfluent-bit-logging -o jsonpath{.items[0].metadata.name}) -- cat /fluent-bit/etc/fluent-bit.conf # 测试Fluent Bit健康检查 kubectl port-forward -n logging svc/fluent-bit 2020:2020 curl http://localhost:2020/api/v1/health kill %1四、功能验证测试4.1 创建测试应用# 创建测试命名空间 kubectl create namespace test-monitoring # 部署测试应用 cat test-deployment.yaml EOF apiVersion: apps/v1 kind: Deployment metadata: name: test-app namespace: test-monitoring spec: replicas: 3 selector: matchLabels: app: test-app template: metadata: labels: app: test-app spec: containers: - name: nginx image: nginx:alpine ports: - containerPort: 80 resources: requests: memory: 128Mi cpu: 100m limits: memory: 256Mi cpu: 200m - name: log-generator image: busybox command: [sh, -c] args: - | counter0 while true; do echo Test log message $counter at $(date) /proc/1/fd/1 counter$((counter1)) sleep 10 done resources: requests: memory: 64Mi cpu: 50m limits: memory: 128Mi cpu: 100m EOF kubectl apply -f test-deployment.yaml # 创建测试服务 kubectl expose deployment test-app -n test-monitoring --port804.2 验证监控功能# 等待Pod启动 kubectl get pods -n test-monitoring -w # 查看监控指标 # 1. 访问Prometheus kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090 # 在Prometheus UI中查询 # container_cpu_usage_seconds_total{namespacetest-monitoring} # container_memory_working_set_bytes{namespacetest-monitoring} # 2. 访问Grafana kubectl port-forward -n monitoring svc/prometheus-stack-grafana 3000:80 # 登录Grafana查看预置的Kubernetes监控面板 # 关闭端口转发 kill %1 %24.3 验证日志收集功能# 查看测试Pod所在节点 kubectl get pods -n test-monitoring -o wide # 登录任意节点查看日志文件 # 假设Pod在node1上 ssh node1 ls -la /var/log/k8s-logs/ | head -10 ssh node1 tail -f /var/log/k8s-logs/*test-app*.log # 或者查看Fluent Bit收集状态 kubectl logs -n logging -l k8s-appfluent-bit-logging --tail50 | grep -i test-app4.4 验证动态扩缩容# 扩展测试应用 kubectl scale deployment test-app -n test-monitoring --replicas5 # 检查新Pod日志是否被收集 kubectl get pods -n test-monitoring -o wide # 登录新Pod所在节点查看日志文件 # 缩减测试应用 kubectl scale deployment test-app -n test-monitoring --replicas2五、清理测试资源# 清理测试应用 kubectl delete namespace test-monitoring # 可选清理监控和日志系统 # kubectl delete namespace monitoring # kubectl delete namespace logging # 清理Node上的日志目录如果需要 # 在每个Node上执行 # sudo rm -rf /var/log/k8s-logs/* # sudo rm -rf /var/log/flb-storage/*六、维护命令参考6.1 监控系统维护# 查看监控组件状态 kubectl get all -n monitoring # 查看Prometheus存储使用 kubectl exec -n monitoring -it prometheus-prometheus-stack-prometheus-0 -- df -h # 重启Prometheus如果需要 kubectl delete pod -n monitoring -l app.kubernetes.io/nameprometheus # 更新监控配置 helm upgrade prometheus-stack prometheus-community/kube-prometheus-stack -n monitoring -f prometheus-values.yaml6.2 日志系统维护# 查看Fluent Bit状态 kubectl get daemonset -n logging kubectl get pods -n logging -o wide # 重启Fluent Bit滚动重启 kubectl rollout restart daemonset fluent-bit -n logging # 查看日志收集统计 kubectl port-forward -n logging svc/fluent-bit 2020:2020 curl http://localhost:2020/api/v1/metrics # 检查磁盘空间在每个Node上 df -h /var/log du -sh /var/log/k8s-logs/6.3 日志轮转管理# 手动触发日志轮转在每个Node上 sudo logrotate -f /etc/logrotate.d/k8s-pod-logs # 查看logrotate状态 sudo logrotate -d /etc/logrotate.d/k8s-pod-logs # 清理旧日志保留最近30天 find /var/log/k8s-logs/ -name *.log -mtime 30 -delete find /var/log/k8s-logs/ -name *.log.*.gz -mtime 60 -delete七、故障排查7.1 常见问题检查# 1. Pod无法启动 kubectl describe pod pod-name -n namespace kubectl logs pod-name -n namespace --previous # 2. 镜像拉取失败 kubectl describe pod pod-name -n namespace | grep -A 10 Events # 3. 存储卷问题 kubectl describe pvc pvc-name -n namespace # 4. 权限问题 kubectl auth can-i get pods --assystem:serviceaccount:logging:fluent-bit # 5. 网络连接问题 kubectl exec -n logging fluent-bit-pod -- curl -k https://kubernetes.default.svc:443/healthz7.2 日志收集问题排查# 检查Fluent Bit配置 kubectl exec -n logging fluent-bit-pod -- cat /fluent-bit/etc/fluent-bit.conf # 检查日志文件权限 kubectl exec -n logging fluent-bit-pod -- ls -la /var/log/containers/ # 开启调试模式 # 修改ConfigMap将Log_Level改为debug然后重启DaemonSet八、升级和扩展8.1 升级监控系统# 查看当前版本 helm list -n monitoring # 升级到新版本 helm repo update helm upgrade prometheus-stack prometheus-community/kube-prometheus-stack -n monitoring -f prometheus-values.yaml # 回滚如果需要 helm rollback prometheus-stack 1 -n monitoring8.2 扩展日志收集# 增加Fluent Bit资源限制 # 编辑fluent-bit-daemonset.yaml修改resources部分然后应用 # 添加新的日志过滤规则 # 编辑fluent-bit-configmap.yaml在filter-kubernetes.conf中添加新的过滤器部署完成确认清单Prometheus Stack所有Pod正常运行Grafana可以正常访问和登录Prometheus可以查询到监控指标Fluent Bit在所有Node上运行Node上创建了/var/log/k8s-logs目录logrotate配置已安装测试应用日志可以被收集监控指标可以正常显示动态扩缩容测试通过注意事项所有镜像需要提前导入内网镜像仓库存储类StorageClass需要根据实际环境配置生产环境务必修改Grafana管理员密码根据集群规模调整资源限制requests/limits定期检查磁盘空间避免日志占满磁盘建议定期备份重要配置如Grafana仪表板这个部署文档提供了从零开始部署监控和日志收集系统的完整步骤。请根据实际环境替换文档中的占位符并按顺序执行命令。