【原创】企业级监控prometheus监控你的鸡群（一）

Prometheus是一个流行的开源监控解决方案，用于记录和查询系统的指标数据。它提供了多种方式来监控目标的状态、警报并进行查询分析。

安装前准备

安装前先完成NetBird组网，如果你不想把监控指标端口暴露在公网上的话。请安装好docker和docker-compose，如何安装见官方文档。

NODE客户端安装

docker-compose.yml

  node_exporter:
    image: quay.io/prometheus/node-exporter:latest
    container_name: node_exporter
    command:
      - '--path.rootfs=/host'
    network_mode: host
    pid: host
    restart: unless-stopped
    volumes:
      - '/:/host:ro,rslave'

启动

docker-compose up -d
#确认安装是否成功，如果输出一大堆指标数据说明成功
curl localhost:9100/metrics

服务端安装

密码生成器

apt install apache2-utils
htpasswd -nBC 12 '' | tr -d ':\n'
#输入密码和确认密码，并将得到的加密后的密码牢记

docker-compose.yml

  prometheus:
    container_name: prometheus
    image: prom/prometheus:latest
    volumes:
      - /data/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - /data/prometheus/htpasswd.yml:/etc/prometheus/htpasswd.yml
      - /data/prometheus/alert/:/etc/prometheus/alert/
      - /data/prometheus/data/:/prometheus
    ports:
      - '9090:9090'
    restart: always
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--web.config.file=/etc/prometheus/htpasswd.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/usr/share/prometheus/console_libraries'
      - '--web.console.templates=/usr/share/prometheus/consoles'
      - '--storage.tsdb.retention=30d'
      - '--web.enable-admin-api'
      - '--web.enable-lifecycle'
      - '--web.external-url=https://prometheus.xxx.com/'
  alertmanager:
    container_name: alertmanager
    image: prom/alertmanager:latest
    volumes:
      - /data/alertmanager/:/alertmanager/
      - /data/alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
      - /data/prometheus/htpasswd.yml:/etc/alertmanager/htpasswd.yml
    ports:
      - '9093:9093'
    restart: always
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--storage.path=/alertmanager'
      - '--web.config.file=/etc/alertmanager/htpasswd.yml'
      - '--web.external-url=https://alertmanager.xxx.com/'
  grafana:
    container_name: grafana
    image: grafana/grafana:latest
    volumes:
      - /data/grafana:/var/lib/grafana
    ports:
      - '3000:3000'
    restart: always

参数说明 --storage.tsdb.retention=30d 数据保留30天 --web.enable-admin-api 控制对admin HTTP API的访问，其中包括删除时间序列等功能例子：curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={instance="xxx.xxx.xxx.xxx:9100"}' --web.enable-lifecycle 支持热更新例子：curl -X POST localhost:9090/-/reload 注意：在容器外修改yaml配置文件是徒劳的，因为容器外修改文件容器内是不会生效的，具体原因就不讲了，感兴趣的同学可以自行搜索了解。所以在容器外修改配置文件只能restart容器。

创建挂载目录

cd /data
mkdir prometheus alertmanager grafana
chmod a+rw grafana/ alertmanager/
cd /data/prometheus/
mkdir data
chmod a+rw data/

prometheus配置

密码文件路径：/data/prometheus/htpasswd.yml，内容如下

basic_auth_users:
  用户名: 之前生成的加密密码

配置文件路径：/data/prometheus/prometheus.yml

global:
  scrape_interval:     60s
  evaluation_interval: 60s
 
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - 服务端IP:9093
    basic_auth:
      username: 用户名
      password: 明文密码

rule_files:
  - "alert/*rules.yml"
   
scrape_configs:
  - job_name: 'prometheus'
    basic_auth:
      username: 用户名
      password: 明文密码
    static_configs:
    - targets: ['服务端IP:9090']
 
  - job_name: 'node'
    static_configs:
    - targets: ['客户端IP:9100']
      labels:
        name: '机器名'
        vendor: '鸡场名称'
        group: '分类，比如国家' 
    - targets: ['其他节点客户端IP:9100']
      labels:
        name: '机器名'
        vendor: '鸡场名称'
        group: '分类'

  - job_name: 'alertmanager'
    basic_auth:
      username: 用户名
      password: 明文密码
    static_configs:
    - targets: ['服务端IP:9093']

IP查看方式：

netbird status

NODE鸡告警规则配置

配置文件路径：/data/prometheus/alert/alert-node-rules.yml

groups:
  - name: node_usage_record_rules
    interval: 1m
    rules:
    - record: cpu:usage:rate1m
      expr: (1 - avg(irate(node_cpu_seconds_total{mode="idle"}[3m])) by (instance,vendor,account,group,name)) * 100
    - record: mem:usage:rate1m
      expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
  - name: node-alert
    rules:
    - alert: NodeDown
      expr: up{job="node"} == 0
      for: 5m
      labels:
        severity: critical
        instance: "{{ $labels.instance }}"
      annotations:
        summary: "instance: {{ $labels.instance }} down"
        description: "Instance: {{ $labels.instance }} 已经宕机 5分钟"
        value: "{{ $value }}"
          
    - alert: NodeCpuHigh
      expr: (1-sum(increase(node_cpu_seconds_total{job="node",mode="idle"} [5m]) ) by(instance) / sum(increase(node_cpu_seconds_total{job="node"} [5m]) ) by(instance) )*100 > 80
      for: 5m
      labels:
        severity: warning
        instance: "{{ $labels.instance }}"
      annotations:
        summary: "instance: {{ $labels.instance }} cpu使用率过高"
        description: "CPU 使用率超过 80%"
        value: "{{ $value }}"
  
    - alert: NodeCpuIowaitHigh
      expr: avg by (instance) (rate(node_cpu_seconds_total{job="node",mode="iowait"}[5m])) * 100 > 50
      for: 5m
      labels:
        severity: warning
        instance: "{{ $labels.instance }}"
      annotations:
        summary: "instance: {{ $labels.instance }} cpu iowait 使用率过高"
        description: "CPU iowait 使用率超过 50%"
        value: "{{ $value }}"
  
    - alert: NodeLoad5High
      expr: node_load5 > (count by (instance) (node_cpu_seconds_total{job="node",mode='system'})) * 1.2
      for: 5m
      labels:
        severity: warning
        instance: "{{ $labels.instance }}"
      annotations:
        summary: "instance: {{ $labels.instance }} load(5m) 过高"
        description: "Load(5m) 过高，超出cpu核数 1.2倍"
        value: "{{ $value }}"
  
    - alert: NodeMemoryHigh
      expr: (1 - node_memory_MemAvailable_bytes{job="node"} / node_memory_MemTotal_bytes{job="node"}) * 100 > 90
      for: 5m
      labels:
        severity: warning
        instance: "{{ $labels.instance }}"
      annotations:
        summary: "instance: {{ $labels.instance }} memory 使用率过高"
        description: "Memory 使用率超过 90%"
        value: "{{ $value }}"
  
    - alert: NodeDiskRootHigh
      expr: (1 - node_filesystem_avail_bytes{job="node",fstype=~"ext.*|xfs"} / node_filesystem_size_bytes{job="node",fstype=~"ext.*|xfs"}) * 100 > 90
      for: 10m
      labels:
        severity: warning
        instance: "{{ $labels.instance }}"
      annotations:
        summary: "instance: {{ $labels.instance }} disk 分区使用率过高"
        description: "Disk 分区使用率超过 90%"
        value: "{{ $value }}"
  
    - alert: NodeDiskBootHigh
      expr: (1 - node_filesystem_avail_bytes{job="node",fstype=~"ext.*|xfs",mountpoint ="/boot"} / node_filesystem_size_bytes{job="node",fstype=~"ext.*|xfs",mountpoint ="/boot"}) * 100 > 80
      for: 10m
      labels:
        severity: warning
        instance: "{{ $labels.instance }}"
      annotations:
        summary: "instance: {{ $labels.instance }} disk(/boot 分区) 使用率过高"
        description: "Disk(/boot 分区) 使用率超过 80%"
        value: "{{ $value }}"
  
    - alert: NodeDiskReadHigh
      expr: irate(node_disk_read_bytes_total{job="node"}[5m]) > 20 * (1024 ^ 2)
      for: 5m
      labels:
        severity: warning
        instance: "{{ $labels.instance }}"
      annotations:
        summary: "instance: {{ $labels.instance }} disk 读取字节数 速率过高"
        description: "Disk 读取字节数 速率超过 20 MB/s"
        value: "{{ $value }}"
  
    - alert: NodeDiskWriteHigh
      expr: irate(node_disk_written_bytes_total{job="node"}[5m]) > 20 * (1024 ^ 2)
      for: 5m
      labels:
        severity: warning
        instance: "{{ $labels.instance }}"
      annotations:
        summary: "instance: {{ $labels.instance }} disk 写入字节数 速率过高"
        description: "Disk 写入字节数 速率超过 20 MB/s"
        value: "{{ $value }}"
          
    - alert: NodeDiskReadRateCountHigh
      expr: irate(node_disk_reads_completed_total{job="node"}[5m]) > 3000
      for: 5m
      labels:
        severity: warning
        instance: "{{ $labels.instance }}"
      annotations:
        summary: "instance: {{ $labels.instance }} disk iops 每秒读取速率过高"
        description: "Disk iops 每秒读取速率超过 3000 iops"
        value: "{{ $value }}"
  
    - alert: NodeDiskWriteRateCountHigh
      expr: irate(node_disk_writes_completed_total{job="node"}[5m]) > 3000
      for: 5m
      labels:
        severity: warning
        instance: "{{ $labels.instance }}"
      annotations:
        summary: "instance: {{ $labels.instance }} disk iops 每秒写入速率过高"
        description: "Disk iops 每秒写入速率超过 3000 iops"
        value: "{{ $value }}"

    - alert: NodeFilefdAllocatedPercentHigh
      expr: node_filefd_allocated{job="node"} / node_filefd_maximum{job="node"} * 100 > 80
      for: 10m
      labels:
        severity: warning
        instance: "{{ $labels.instance }}"
      annotations:
        summary: "instance: {{ $labels.instance }} filefd 打开百分比过高"
        description: "Filefd 打开百分比 超过 80%"
        value: "{{ $value }}"
  
    - alert: NodeNetworkNetinBitRateHigh
      expr: avg by (instance) (irate(node_network_receive_bytes_total{device=~"eth0|docker0|wt0"}[1m]) * 8) > 20 * (1024 ^ 2) * 8
      for: 3m
      labels:
        severity: warning
        instance: "{{ $labels.instance }}"
      annotations:
        summary: "instance: {{ $labels.instance }} network 接收比特数 速率过高"
        description: "Network 接收比特数 速率超过 20MB/s"
        value: "{{ $value }}"
  
    - alert: NodeNetworkNetoutBitRateHigh
      expr: avg by (instance) (irate(node_network_transmit_bytes_total{device=~"eth0|docker0|wt0"}[1m]) * 8) > 20 * (1024 ^ 2) * 8
      for: 3m
      labels:
        severity: warning
        instance: "{{ $labels.instance }}"
      annotations:
        summary: "instance: {{ $labels.instance }} network 发送比特数 速率过高"
        description: "Network 发送比特数 速率超过 20MB/s"
        value: "{{ $value }}"
          
    - alert: NodeNetworkNetinPacketErrorRateHigh
      expr: avg by (instance) (irate(node_network_receive_errs_total{device=~"eth0|docker0|wt0"}[1m])) > 15
      for: 3m
      labels:
        severity: warning
        instance: "{{ $labels.instance }}"
      annotations:
        summary: "instance: {{ $labels.instance }} 接收错误包 速率过高"
        description: "Network 接收错误包 速率超过 15个/秒"
        value: "{{ $value }}"
  
    - alert: NodeProcessBlockedHigh
      expr: node_procs_blocked{job="node"} > 10
      for: 10m
      labels:
        severity: warning
        instance: "{{ $labels.instance }}"
      annotations:
        summary: "instance: {{ $labels.instance }} 当前被阻塞的任务的数量过多"
        description: "Process 当前被阻塞的任务的数量超过 10个"
        value: "{{ $value }}"
  
    - alert: NodeTimeOffsetHigh
      expr: abs(node_timex_offset_seconds{job="node"}) > 3 * 60
      for: 2m
      labels:
        severity: info
        instance: "{{ $labels.instance }}"
      annotations:
        summary: "instance: {{ $labels.instance }} 时间偏差过大"
        description: "Time 节点的时间偏差超过 3m"
        value: "{{ $value }}"

相关的监控规则可参考：https://samber.github.io/awesome-prometheus-alerts/

alertmanager配置

配置文件路径：/data/alertmanager/alertmanager.yml

global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.qq.com:465'             #邮箱smtp服务器代理，启用SSL发信, 端口一般是465
  smtp_from: 'xxx@qq.com'              #发送邮箱名称
  smtp_auth_username: 'xxx@qq.com'              #邮箱名称
  smtp_auth_password: 'xxxxxxxxxx'                #邮箱密码或授权码
  smtp_require_tls: false
 
route:
  receiver: 'default'
  group_wait: 10s
  group_interval: 1m
  repeat_interval: 1h
  group_by: ['alertname']
 
inhibit_rules:
- source_match:
    severity: 'critical'
  target_match:
    severity: 'warning'
  equal: ['alertname']
   
receivers:
- name: 'default'
  email_configs:
  - to: 'xxx@qq.com'
    send_resolved: true
  webhook_configs:
  - url: 'http://IP:端口/robot/send'
    send_resolved: true

启动

docker-compose up -d
#查看日志是否有报错，未见报错信息说明启动成功
docker logs -f grafana
docker logs -f prometheus
docker logs -f alertmanager

nginx配置

    server {
        listen       443 ssl;
        server_name  prometheus.xxx.com;

        ssl_certificate      /etc/nginx/cert/xxx.com.pem;
        ssl_certificate_key  /etc/nginx/cert/xxx.com.key;

        access_log logs/prometheus.access.log main;
        error_log logs/prometheus.error.log  notice;

        ssl_session_timeout  1d;
        ssl_session_cache shared:MozSSL:10m;
        ssl_session_tickets off;

        ssl_protocols TLSv1 TLSv1.1 TLSv1.2;
        ssl_ciphers ECDHE-RSA-AES128-GCM-SHA256:HIGH:!aNULL:!MD5:!RC4:!DHE;
        ssl_prefer_server_ciphers   off;

        location / {
            proxy_set_header Host $host;
            proxy_pass http://服务端IP:9090/;
        }
    }

    server {
        listen 443 ssl;
        server_name grafana.xxx.com;

        ssl_certificate      /etc/nginx/cert/xxx.com.pem;
        ssl_certificate_key  /etc/nginx/cert/xxx.com.key;

        access_log  logs/grafana.access.log  main;
        error_log   logs/grafana.error.log notice;

        ssl_session_timeout  1d;
        ssl_session_cache shared:MozSSL:10m;
        ssl_session_tickets off;

        ssl_protocols TLSv1 TLSv1.1 TLSv1.2;
        ssl_ciphers ECDHE-RSA-AES128-GCM-SHA256:HIGH:!aNULL:!MD5:!RC4:!DHE;
        ssl_prefer_server_ciphers   off;

        location / {
            add_header 'Access-Control-Allow-Origin' '*';
            proxy_pass http://服务端IP:3000/;
            proxy_set_header Upgrade $http_upgrade;
            proxy_set_header Connection "Upgrade";
            proxy_set_header Host $http_host;
            proxy_http_version 1.1;
        }
    }

    server {
        listen       443 ssl;
        server_name  alertmanager.xxx.com;

        ssl_certificate      /etc/nginx/cert/xxx.com.pem;
        ssl_certificate_key  /etc/nginx/cert/xxx.com.key;

        access_log  logs/alertmanager.access.log  main;
        error_log   logs/alertmanager.error.log notice;

        ssl_session_timeout  1d;
        ssl_session_cache shared:MozSSL:10m;
        ssl_session_tickets off;

        ssl_protocols TLSv1 TLSv1.1 TLSv1.2;
        ssl_ciphers ECDHE-RSA-AES128-GCM-SHA256:HIGH:!aNULL:!MD5:!RC4:!DHE;
        ssl_prefer_server_ciphers   off;

        location / {
            proxy_set_header Host $host;
            proxy_pass http://服务端IP:9093/;
        }
    }

重启nginx后接入cf

初始化grafana

在浏览器访问https://grafana.xxx.com 在登录框输入用户名密码：admin/admin 登录后完成密码修改完成后进入系统，点击右上角头像，选择Profile，完成用户名、名称、邮箱修改，下次你就可以用修改后的用户名或邮箱登录了。

查看prometheus

新建浏览器窗口访问：https://prometheus.xxx.com/ 输入prometheus.yml配置文件中的用户名密码，进入系统。点击菜单栏的Status，选择Targets 在界面看到UP表示健康指标正常采集，DOWN表示客户端连接失败

配置grafana

配置数据源

回到grafana 点击左上角菜单图标，选择Add new connection 在搜索栏中输入prom,选择prometheus 输入以下内容,prometheus的IP端口，开启认证，prometheus配置的用户名密码。然后拉倒页面最底下，点击Save & Test，提示“Successfully queried the Prometheus API.”说明完成。

导入node监控面板

点击左上角菜单图标，选择Dashboards 在搜索框右侧点击NEW，选择Import 输入模版编号8919，点击Load 8919模版介绍点击导入，由于我这边模板已经存在，被提示覆盖导入返回Dashboards，点击新导入的面板名称你就可以看到漂亮的监控页面啦！

其他实例监控

prometheus非常强大，除了监控你的鸡群，还可以监控redis，mysql等等，这里是java监控的一个例子。

即时通讯告警接入

基于Bark,请根据这篇文档部署。 prometheus-webhook-bark 源码链接: https://pan.baidu.com/s/1OxCFdtyOGTFGJCBALl8ByA 提取码: 14n9 如果你懂JAVA开发,请下载prometheus-webhook-bark.tar.gz编译如果你懂JAVA运行环境部署，请下载prometheus-webhook-bark.jar自行部署如果你只会docker，请下载prometheus-webhook-bark.latest.tar部署以下是docker的部署方式: 加载镜像

docker load < prometheus-webhook-bark.latest.tar

docker-compose

  prometheus-webhook-bark:
    image: prometheus-webhook-bark:latest
    container_name: prometheus-webhook-bark
    restart: always
    volumes:
      - /data/app/prometheus-webhook-bark/:/opt/resources/
    ports:
      - '8084:8084'

配置文件路径 /data/app/prometheus-webhook-bark/application.properties

server.port=8084
spring.application.name=prometheuswebhookbark
bark.dk=xxx  修改成你的bark key
bark.openapi=https://bark.xxx.com/push 修改成你的bark域名

启动后，请手工关闭一只鸡的node客户端,如果部署无问题的话，你的iphone手机、邮箱5分钟后将收到告警信息【设置了延迟5分钟】同时，你的prometheus和alertmanager界面上可看到告警信息如果你4小时内不去修复，还会继续收到告警信息，如果你修复了故障，将收到恢复的信息。

【原创】企业级监控prometheus监控你的鸡群（一）

安装前准备

NODE客户端安装

docker-compose.yml

服务端安装

密码生成器

docker-compose.yml

创建挂载目录

prometheus配置

NODE鸡告警规则配置

alertmanager配置

nginx配置

初始化grafana

查看prometheus

配置grafana

配置数据源

导入node监控面板

其他实例监控

即时通讯告警接入

推荐阅读

Ubuntu 扩容LVM存储

CVE-2021-4034漏洞复现

评论