一、简介

prometheus 核心是一个单独的二进制方式文件 pull模型 内置的时间序列数据库(TSDB) 强大的查询语言 PromQL 可视化 开放化

1、存储计算层

Prometheus Server ,里面包含了存储引擎和计算引擎

Retrieval 组件为取数组件,它会主动从Pushgateway 或Exporter 拉取数据

Service discovery 可以动态发现要监控的目标

TSDB ,数据核心存储和查询

HTTP server ,对外提供HTTP 服务

2、采集层

采集层分为两类,一类是生命周期较短的作业,还有一类是生命周期较长的作业

短作业: 直接通过API ,在退出时间指标推送给Pushgateway

长作业: Retrieval 组件直接从Job 或者Exporter 拉取数据

3、应用层

应用层主要分为 两种 ,一种是AlertManager,另一种是数据可视化

AlertManager 对接Pagerduty ,是一套付费的监控报警系统,短信 ,电话,Email 发邮件

数据可视化 Prometheus build-in WebUI Grafana 其他基于API开发的客户端

二、实操 利用docker 安装prometheus 、granfana

1.统一环境配置

下载docker 并关闭防火墙和selinux

2.下载相关镜像

docker pull prom/prometheus docker pull prom/alertmanager docker pull grafana/grafana

3.启动相关组件

prometheus-webhook-dingtalk 启动

docker run -d -p 8060:8060 -v /data/prom/config.yml:/etc/prometheus-webhook-dingtalk/config.yml --name alertdingtalk timonwong/prometheus-webhook-dingtalk

alertmanager 启动

docker run -d -p 9093:9093 -p 9094:9094 -v /data/prom/alertmanager.yml:/etc/alertmanager/alertmanager.yml --name alertmanager prom/alertmanager 

prometheus 启动

docker run -d -p 9090:9090 \
-v /data/prom/prometheus.yml:/etc/prometheus/prometheus.yml \
-v /data/prom/alert-rules.yml:/etc/prometheus/alert-rules.yml \
-v /data/prom/data:/prometheus --name prometheus prom/prometheus:latest

grafana启动

docker run -d -p 3000:3000 -v /data/prom/grafana:/var/lib/grafana --name=grafana grafana/grafana:latest

相关配置yml文件参考如下:

alertmanager.yml

global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.exmail.qq.com:465'                #邮箱smtp服务器代理,启用SSL发信, 端口一般是465
  smtp_from: 'test@qq.com'              #发送邮箱名称
  smtp_auth_username: 'test@qq.com'              #邮箱名称
  smtp_auth_password: 'passwd'                #邮箱密码或授权码
  smtp_require_tls: false
route:
  receiver: 'default-receiver'    # 所有不匹配以下子路由的告警都将保留在根节点,并发送到“default-receiver”
  group_wait: 30s                 # 为一个组发送通知的初始等待时间,默认30s
  group_interval: 5m              # 在发送新告警前的等待时间。通常5m或以上
  repeat_interval: 1h             # 发送重复告警的周期。如果已经发送了通知,再次发送之前需要等待多长时间。
  group_by: [alertname]  # 报警分组依据

  routes:- receiver: 'bigdata-pager'    # 所有带有team=bigdata标签的告警都与此子路由匹配,可以自己在alert-rules.yml中的labels添加即可 
    group_wait: 10s
    match:
      team: bigdata
receivers:                        # 定义接收者,将告警发送给谁
- name: 'default-receiver'
  email_configs:
  - to: 'xx@qq.com,xx@qq.com'

- name: 'bigdata-pager'
  email_configs:
  - to: 'xxx@qq.com,xx@qq.com'

prometheus.yml

global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
alerting:       #指定alertmanager报警组件地址
  alertmanagers:
  - static_configs:
    - targets: [ '192.168.188.2:9093']
 
rule_files:  #指定报警规则文件
  - "*rules.yml"
 
scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['192.168.188.2:9090']
       
  - job_name: 'node'
    static_configs:
      - targets: ['192.168.188.3:9100']
 
  - job_name: 'alertmanager'
    static_configs:
      - targets: [ '192.168.188.2:9093']

alert-rules.yml

groups:
- name: 主机状态-监控告警
  rules:
  - alert: 主机状态
    expr: up *on(instance)group_left(nodename)(node_uname_info) == 0
    for: 5m
    labels:
      level: waring
    annotations:
      summary: "{{$labels.instance}}:服务器宕机"
      description: "{{$labels.instance}}({{$labels.nodename}}):服务器延时超过3分钟"
  - alert: 主机cpu使用情况
    expr: 100-avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance) *100  *on(instance)group_left(nodename)(node_uname_info) > 85
    for: 3m
    labels:
      level: waring
    annotations:
      summary: "{{ $labels.instance }}cpu使用率过高"
      description: "{{ $labels.instance }}({{$labels.nodename}}):cpu使用率超过85%(当前使用率: {{ $value }}%)"
  - alert: 主机内存使用情况
    expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes)) / node_memory_MemTotal_bytes* 100 *on(instance)group_left(nodename)(node_uname_info) > 90
    for: 3m
    labels:
      level: waring
    annotations:
      summary: "{{$labels.instance}}: High Memory usage detected"
      description: "{{$labels.instance}}({{$labels.nodename}}): 内存使用率超过 90% (当前使用率: {{ $value }}%)"
  - alert: 主机磁盘使用情况
    expr: 100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_bytes {fstype=~"ext4|xfs"})*100 *on(instance)group_left(nodename)(node_uname_info)  > 85
    for: 3m
    labels:
      level: waring
    annotations:
      summary: "{{ $labels.instance }} 磁盘空间使用率过高!"
      description: "{{ $labels.instance }}({{$labels.nodename}}): 磁盘空间使用大于85%(当前使用率: {{$value}}%)"
  - alert: 磁盘IO性能
    expr: 100-(avg(irate(node_disk_io_time_seconds_total[1m])) by(instance) *100) *on(instance)group_left(nodename)(node_uname_info)   < 60
    for: 3m
    labels:
      level: waring
    annotations:
      summary: "{{ $labels.instance }} 流入磁盘IO使用率过高!"
      description: "{{ $labels.instance }}({{$labels.nodename}}): 流入磁盘IO大于60%(当前使用率: {{$value}}%)"
  - alert: TCP会话
    expr: node_netstat_Tcp_CurrEstab *on(instance)group_left(nodename)(node_uname_info)  > 10000
    for: 3m
    labels:
      level: waring
    annotations:
      summary: "{{ $labels.instance }} TCP_ESTABLISHED过高!"
      description: "{{ $labels.instance }}({{$labels.nodename}}): TCP_ESTABLISHED大于1000%(当前使用率: {{$value}}%)"
  - alert: inside网络
    expr: ((sum(rate (node_network_receive_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100) *on(instance)group_left(nodename)(node_uname_info)   > 819200
    for: 3m
    labels:
      level: waring
    annotations:
      summary: "{{ $labels.instance }} 流入网络带宽过高!"
      description: "{{ $labels.instance }}({{$labels.nodename}}): 流入网络带宽持续2分钟高于800M(当前使用: {{$value}})"
  - alert: outside网络
    expr: ((sum(rate (node_network_transmit_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100 ) *on(instance)group_left(nodename)(node_uname_info) > 819200
    for: 3m
    labels:
      level: waring
    annotations:
      summary: "{{ $labels.instance }} 流出网络带宽过高!"
      description: "{{ $labels.instance }}({{$labels.nodename}}): 流出网络带宽持续2分钟高于800M(当前使用: {{$value}})"

  - alert: node_exporter 监控
    expr: up{job="consul-prometheus"} == 0
    for: 15s
    labels:
      level: waring
    annotations:
      summary: "{{ $labels.instance }}  node_exporter 已停止运行超过 15s!"
      description: "{{ $labels.instance }}({{$labels.job}}) 已停止运行超过 15s!"

三、相关node-exporter的安装

1.node-exporter 的安装

Node-exporter需要监控实际的主机硬件信息, 不推荐用docker来安装,建议通过二进制包来安装

docker安装

docker run -d -p 9100:9100 --name node-exporter prom/node-exporter:latest
docker run -d -p 9100:9100 --net=host -v "/proc:/host/proc:ro" -v "/sys:/host/sys:ro" -v "/:/rootfs:ro" --name node-exporter prom/node-exporter:latest

二进制包安装

客户端下载地址:https://github.com/prometheus/node_exporter/releases

同样找到Linux-amd64这个版本,下载解压即可

#下载
wget https://github.com/prometheus/node_exporter/releases/download/v1.5.0/node_exporter-1.5.0.linux-amd64.tar.gz
#解压
tar -zxvf node_exporter-1.5.0.linux-amd64.tar.gz
#重命名
mv node_exporter-1.5.0.linux-amd64 node_exporter

启动方式:
#不保存日志
nohup ./node_exporter >/dev/null 2>&1 &
#保存日志到/var/log/node_exporter.log
nohup ./node_exporter >/var/log/node_exporter.log 2>&1 &



使用systemd方式启动 

 cat >/usr/lib/systemd/system/node_exporter.service  <<EOF
[Unit]
Description=node_exporter
[Service]
ExecStart=/usr/local/node_exporter/node_exporter   --collector.systemd --collector.systemd.unit-include=(docker|portal|sshd).service
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
Restart=on-failure
[Install]
WantedBy=multi-user.target

EOF
---

systemctl daemon-reload

systemctl enable node_exporter.service

systemctl start node_exporter.service

2.mysqld_exporter 的安装

MySQL需要注意先在创建用于监视数据库的用户exporter

mysql> CREATE USER 'exporter'@'localhost' IDENTIFIED BY 'see_teampass' WITH MAX_USER_CONNECTIONS 5;

mysql> GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'localhost';

说明:使用max_user_connections参数来限制exporter用户最大连接数,避免监控引起数据库过载,需要注意的是该参数并不是MySQL/Mariadb每个版本都支持

mysql > flush privileges;

下载安装mysqld_exporter

https://github.com/prometheus/mysqld_exporter/releases

tar xvf mysqld_exporter-0.14.1.linux-amd64.tar.gz  -C /usr/local/

cd /usr/local/

mv mysqld_exporter-0.14.1.linux-amd64 mysqld_exporter

cd /usr/local/mysqld_exporter


创建连接文件
----------
cat > .my.cnf <<EOF

[client]

user=exporter

password=see_teampass

EOF

---------

使用systemd方式启动 

 cat >/usr/lib/systemd/system/mysqld_exporter.service  <<EOF

[Unit]

Description=Prometheus

[Service]

ExecStart=/usr/local/mysqld_exporter/mysqld_exporter --config.my-cnf=/usr/local/mysqld_exporter/.my.cnf

Restart=on-failure

[Install]

WantedBy=multi-user.target

EOF

---------

systemctl daemon-reload

 systemctl enable mysqld_exporter

 systemctl start mysqld_exporter

3.blackbox exporter的安装

二进制安装

wget  https://github.com/prometheus/blackbox_exporter/releases/download/v0.23.0/blackbox_exporter-0.23.0.linux-amd64.tar.gz

tar -zxvf  blackbox_exporter-0.23.0.linux-amd64.tar.gz -C /usr/local

mv /usr/local/blackbox_exporter-0.23.0.linux-amd64.tar.gz   /usr/local/blackbox_exporter

配置下当前探针 cat /usr/local/blackbox_exporter/blackbox.yml 探针类型(prober)

modules:
  http_2xx:
    prober: http
  http_post_2xx:
    prober: http
    http:
      method: POST
  tcp_connect:
    prober: tcp
  pop3s_banner:
    prober: tcp
    tcp:
      query_response:
      - expect: "^+OK"
      tls: true
      tls_config:
        insecure_skip_verify: false
  grpc:
    prober: grpc
    grpc:
      tls: true
      preferred_ip_protocol: "ip4"
  grpc_plain:
    prober: grpc
    grpc:
      tls: false
      service: "service1"
  ssh_banner:
    prober: tcp
    tcp:
      query_response:
      - expect: "^SSH-2.0-"
      - send: "SSH-2.0-blackbox-ssh-check"
  irc_banner:
    prober: tcp
    tcp:
      query_response:
      - send: "NICK prober"
      - send: "USER prober prober prober :prober"
      - expect: "PING :([^ ]+)"
        send: "PONG ${1}"
      - expect: "^:[^ ]+ 001"
  icmp:
    prober: icmp
  icmp_ttl5:
    prober: icmp
    timeout: 5s
    icmp:
      ttl: 5

添加到启动项 cat /usr/lib/systemd/system/blackbox_exporter.service

[Unit]
Description=blackbox_exporter

[Service]
User=root
Type=simple
ExecStart=/usr/local/blackbox_exporter/blackbox_exporter  --config.file=/usr/local/blackbox_exporter/blackbox.yml
Restart=on-failure

检查是否正常运行

同时也可以通过访问 http://127.0.0.1:9115/probe?module=http_2xx&target=baidu.com 对baidu.com 进行探测

这里通过在URL中提供module参数指定了当前使用的探针,target参数指定探测目标,探针的探测结果通过Metrics的形式返回:

# HELP probe_dns_lookup_time_seconds Returns the time taken for probe dns lookup in seconds
# TYPE probe_dns_lookup_time_seconds gauge
probe_dns_lookup_time_seconds 0.004366919
# HELP probe_duration_seconds Returns how long the probe took to complete in seconds
# TYPE probe_duration_seconds gauge
probe_duration_seconds 0.09053371
# HELP probe_failed_due_to_regex Indicates if probe failed due to regex
# TYPE probe_failed_due_to_regex gauge
probe_failed_due_to_regex 0
# HELP probe_http_content_length Length of http content response
# TYPE probe_http_content_length gauge
probe_http_content_length 81
# HELP probe_http_duration_seconds Duration of http request by phase, summed over all redirects
# TYPE probe_http_duration_seconds gauge
probe_http_duration_seconds{phase="connect"} 0.040772637
probe_http_duration_seconds{phase="processing"} 0.04430544
probe_http_duration_seconds{phase="resolve"} 0.004366919
probe_http_duration_seconds{phase="tls"} 0
probe_http_duration_seconds{phase="transfer"} 0.00019256
# HELP probe_http_last_modified_timestamp_seconds Returns the Last-Modified HTTP response header in unixtime
# TYPE probe_http_last_modified_timestamp_seconds gauge
probe_http_last_modified_timestamp_seconds 1.26330408e+09
# HELP probe_http_redirects The number of redirects
# TYPE probe_http_redirects gauge
probe_http_redirects 0
# HELP probe_http_ssl Indicates if SSL was used for the final redirect
# TYPE probe_http_ssl gauge
probe_http_ssl 0
# HELP probe_http_status_code Response HTTP status code
# TYPE probe_http_status_code gauge
probe_http_status_code 200
# HELP probe_http_uncompressed_body_length Length of uncompressed response body
# TYPE probe_http_uncompressed_body_length gauge
probe_http_uncompressed_body_length 81
# HELP probe_http_version Returns the version of HTTP of the probe response
# TYPE probe_http_version gauge
probe_http_version 1.1
# HELP probe_ip_addr_hash Specifies the hash of IP address. It's useful to detect if the IP address changes.
# TYPE probe_ip_addr_hash gauge
probe_ip_addr_hash 3.6694721e+08
# HELP probe_ip_protocol Specifies whether probe ip protocol is IP4 or IP6
# TYPE probe_ip_protocol gauge
probe_ip_protocol 4
# HELP probe_success Displays whether or not the probe was a success
# TYPE probe_success gauge
probe_success 1

从返回的样本中,用户可以获取站点的DNS解析耗时、站点响应时间、HTTP响应状态码等等和站点访问质量相关的监控指标,从而帮助管理员主动的发现故障和问题。

追加到prometheus.yml

# 网站监控
  - job_name: 'http_status'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
        - http://192.168.xx.xx:3928/anywhere/#/before          #某网页
        - https://www.baidu.com                   
          instance: http_status
          group: web
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 192.168.188.68:9115

# ping 检测
  - job_name: 'ping_status'
    metrics_path: /probe
    params:
      module: [icmp]
    static_configs:
      - targets:
        - 192.168.188.200
        labels:
          instance: 'ping_status'
          group: icmp
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 192.168.188.68:9115

# 端口监控
  - job_name: 'port_status'
    metrics_path: /probe
    params:
      module: [tcp_connect]
    static_configs:
      - targets:
        - 192.168.188.10:3928
        - 192.168.188.22:3306
        - 192.168.188.200:8090
        labels:
          instance: 'port_status'
          group: port
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 192.168.188.68:xx

grafana的模板号:9965

告警规则可以监控probe_success参数

icmp、tcp、http、post 监测是否正常可以观察 probe_success 这一指标

probe_success == 0 ##联通性异常

probe_success == 1 ##联通性正常

告警也是判断这个指标是否等于 0,如等于 0 则触发异常报警


大数据hadoop相关监控 搭建信息可参考:

https://github.com/tamtran96/hadoop-jmx-exporter/tree/master/dashboards

pushgateway 进行数据上报 https://cloud.tencent.com/developer/article/1531821

更多prometheus 相关exporter可参考:

https://blog.51cto.com/u_14065119/4166081

grafana 与prometheus 的Nginx反向代理参考链接:

https://blog.csdn.net/Rambo_Yang/article/details/108061345

https://grafana.com/tutorials/run-grafana-behind-a-proxy/#1

其他参考链接:

https://it.cha138.com/mysql/show-99068.html

https://www.infoq.cn/article/sxextntuttxduedeagiq

https://www.prometheus.wang/exporter/install_blackbox_exporter.html