一、简介
prometheus 核心是一个单独的二进制方式文件 pull模型 内置的时间序列数据库(TSDB) 强大的查询语言 PromQL 可视化 开放化
1、存储计算层
Prometheus Server ,里面包含了存储引擎和计算引擎
Retrieval 组件为取数组件,它会主动从Pushgateway 或Exporter 拉取数据
Service discovery 可以动态发现要监控的目标
TSDB ,数据核心存储和查询
HTTP server ,对外提供HTTP 服务
2、采集层
采集层分为两类,一类是生命周期较短的作业,还有一类是生命周期较长的作业
短作业: 直接通过API ,在退出时间指标推送给Pushgateway
长作业: Retrieval 组件直接从Job 或者Exporter 拉取数据
3、应用层
应用层主要分为 两种 ,一种是AlertManager,另一种是数据可视化
AlertManager 对接Pagerduty ,是一套付费的监控报警系统,短信 ,电话,Email 发邮件
数据可视化 Prometheus build-in WebUI Grafana 其他基于API开发的客户端
二、实操 利用docker 安装prometheus 、granfana
1.统一环境配置
下载docker 并关闭防火墙和selinux
2.下载相关镜像
docker pull prom/prometheus docker pull prom/alertmanager docker pull grafana/grafana
3.启动相关组件
prometheus-webhook-dingtalk 启动
docker run -d -p 8060:8060 -v /data/prom/config.yml:/etc/prometheus-webhook-dingtalk/config.yml --name alertdingtalk timonwong/prometheus-webhook-dingtalk
alertmanager 启动
docker run -d -p 9093:9093 -p 9094:9094 -v /data/prom/alertmanager.yml:/etc/alertmanager/alertmanager.yml --name alertmanager prom/alertmanager
prometheus 启动
docker run -d -p 9090:9090 \
-v /data/prom/prometheus.yml:/etc/prometheus/prometheus.yml \
-v /data/prom/alert-rules.yml:/etc/prometheus/alert-rules.yml \
-v /data/prom/data:/prometheus --name prometheus prom/prometheus:latest
grafana启动
docker run -d -p 3000:3000 -v /data/prom/grafana:/var/lib/grafana --name=grafana grafana/grafana:latest
相关配置yml文件参考如下:
alertmanager.yml
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.exmail.qq.com:465' #邮箱smtp服务器代理,启用SSL发信, 端口一般是465
smtp_from: 'test@qq.com' #发送邮箱名称
smtp_auth_username: 'test@qq.com' #邮箱名称
smtp_auth_password: 'passwd' #邮箱密码或授权码
smtp_require_tls: false
route:
receiver: 'default-receiver' # 所有不匹配以下子路由的告警都将保留在根节点,并发送到“default-receiver”
group_wait: 30s # 为一个组发送通知的初始等待时间,默认30s
group_interval: 5m # 在发送新告警前的等待时间。通常5m或以上
repeat_interval: 1h # 发送重复告警的周期。如果已经发送了通知,再次发送之前需要等待多长时间。
group_by: [alertname] # 报警分组依据
routes:- receiver: 'bigdata-pager' # 所有带有team=bigdata标签的告警都与此子路由匹配,可以自己在alert-rules.yml中的labels添加即可
group_wait: 10s
match:
team: bigdata
receivers: # 定义接收者,将告警发送给谁
- name: 'default-receiver'
email_configs:
- to: 'xx@qq.com,xx@qq.com'
- name: 'bigdata-pager'
email_configs:
- to: 'xxx@qq.com,xx@qq.com'
prometheus.yml
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
alerting: #指定alertmanager报警组件地址
alertmanagers:
- static_configs:
- targets: [ '192.168.188.2:9093']
rule_files: #指定报警规则文件
- "*rules.yml"
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['192.168.188.2:9090']
- job_name: 'node'
static_configs:
- targets: ['192.168.188.3:9100']
- job_name: 'alertmanager'
static_configs:
- targets: [ '192.168.188.2:9093']
alert-rules.yml
groups:
- name: 主机状态-监控告警
rules:
- alert: 主机状态
expr: up *on(instance)group_left(nodename)(node_uname_info) == 0
for: 5m
labels:
level: waring
annotations:
summary: "{{$labels.instance}}:服务器宕机"
description: "{{$labels.instance}}({{$labels.nodename}}):服务器延时超过3分钟"
- alert: 主机cpu使用情况
expr: 100-avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance) *100 *on(instance)group_left(nodename)(node_uname_info) > 85
for: 3m
labels:
level: waring
annotations:
summary: "{{ $labels.instance }}cpu使用率过高"
description: "{{ $labels.instance }}({{$labels.nodename}}):cpu使用率超过85%(当前使用率: {{ $value }}%)"
- alert: 主机内存使用情况
expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes)) / node_memory_MemTotal_bytes* 100 *on(instance)group_left(nodename)(node_uname_info) > 90
for: 3m
labels:
level: waring
annotations:
summary: "{{$labels.instance}}: High Memory usage detected"
description: "{{$labels.instance}}({{$labels.nodename}}): 内存使用率超过 90% (当前使用率: {{ $value }}%)"
- alert: 主机磁盘使用情况
expr: 100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_bytes {fstype=~"ext4|xfs"})*100 *on(instance)group_left(nodename)(node_uname_info) > 85
for: 3m
labels:
level: waring
annotations:
summary: "{{ $labels.instance }} 磁盘空间使用率过高!"
description: "{{ $labels.instance }}({{$labels.nodename}}): 磁盘空间使用大于85%(当前使用率: {{$value}}%)"
- alert: 磁盘IO性能
expr: 100-(avg(irate(node_disk_io_time_seconds_total[1m])) by(instance) *100) *on(instance)group_left(nodename)(node_uname_info) < 60
for: 3m
labels:
level: waring
annotations:
summary: "{{ $labels.instance }} 流入磁盘IO使用率过高!"
description: "{{ $labels.instance }}({{$labels.nodename}}): 流入磁盘IO大于60%(当前使用率: {{$value}}%)"
- alert: TCP会话
expr: node_netstat_Tcp_CurrEstab *on(instance)group_left(nodename)(node_uname_info) > 10000
for: 3m
labels:
level: waring
annotations:
summary: "{{ $labels.instance }} TCP_ESTABLISHED过高!"
description: "{{ $labels.instance }}({{$labels.nodename}}): TCP_ESTABLISHED大于1000%(当前使用率: {{$value}}%)"
- alert: inside网络
expr: ((sum(rate (node_network_receive_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100) *on(instance)group_left(nodename)(node_uname_info) > 819200
for: 3m
labels:
level: waring
annotations:
summary: "{{ $labels.instance }} 流入网络带宽过高!"
description: "{{ $labels.instance }}({{$labels.nodename}}): 流入网络带宽持续2分钟高于800M(当前使用: {{$value}})"
- alert: outside网络
expr: ((sum(rate (node_network_transmit_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100 ) *on(instance)group_left(nodename)(node_uname_info) > 819200
for: 3m
labels:
level: waring
annotations:
summary: "{{ $labels.instance }} 流出网络带宽过高!"
description: "{{ $labels.instance }}({{$labels.nodename}}): 流出网络带宽持续2分钟高于800M(当前使用: {{$value}})"
- alert: node_exporter 监控
expr: up{job="consul-prometheus"} == 0
for: 15s
labels:
level: waring
annotations:
summary: "{{ $labels.instance }} node_exporter 已停止运行超过 15s!"
description: "{{ $labels.instance }}({{$labels.job}}) 已停止运行超过 15s!"
三、相关node-exporter的安装
1.node-exporter 的安装
Node-exporter需要监控实际的主机硬件信息, 不推荐用docker来安装,建议通过二进制包来安装
docker安装
docker run -d -p 9100:9100 --name node-exporter prom/node-exporter:latest
docker run -d -p 9100:9100 --net=host -v "/proc:/host/proc:ro" -v "/sys:/host/sys:ro" -v "/:/rootfs:ro" --name node-exporter prom/node-exporter:latest
二进制包安装
客户端下载地址:https://github.com/prometheus/node_exporter/releases
同样找到Linux-amd64这个版本,下载解压即可
#下载
wget https://github.com/prometheus/node_exporter/releases/download/v1.5.0/node_exporter-1.5.0.linux-amd64.tar.gz
#解压
tar -zxvf node_exporter-1.5.0.linux-amd64.tar.gz
#重命名
mv node_exporter-1.5.0.linux-amd64 node_exporter
启动方式:
#不保存日志
nohup ./node_exporter >/dev/null 2>&1 &
#保存日志到/var/log/node_exporter.log
nohup ./node_exporter >/var/log/node_exporter.log 2>&1 &
使用systemd方式启动
cat >/usr/lib/systemd/system/node_exporter.service <<EOF
[Unit]
Description=node_exporter
[Service]
ExecStart=/usr/local/node_exporter/node_exporter --collector.systemd --collector.systemd.unit-include=(docker|portal|sshd).service
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF
---
systemctl daemon-reload
systemctl enable node_exporter.service
systemctl start node_exporter.service
2.mysqld_exporter 的安装
MySQL需要注意先在创建用于监视数据库的用户exporter
mysql> CREATE USER 'exporter'@'localhost' IDENTIFIED BY 'see_teampass' WITH MAX_USER_CONNECTIONS 5;
mysql> GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'localhost';
说明:使用max_user_connections参数来限制exporter用户最大连接数,避免监控引起数据库过载,需要注意的是该参数并不是MySQL/Mariadb每个版本都支持
mysql > flush privileges;
下载安装mysqld_exporter
https://github.com/prometheus/mysqld_exporter/releases
tar xvf mysqld_exporter-0.14.1.linux-amd64.tar.gz -C /usr/local/
cd /usr/local/
mv mysqld_exporter-0.14.1.linux-amd64 mysqld_exporter
cd /usr/local/mysqld_exporter
创建连接文件
----------
cat > .my.cnf <<EOF
[client]
user=exporter
password=see_teampass
EOF
---------
使用systemd方式启动
cat >/usr/lib/systemd/system/mysqld_exporter.service <<EOF
[Unit]
Description=Prometheus
[Service]
ExecStart=/usr/local/mysqld_exporter/mysqld_exporter --config.my-cnf=/usr/local/mysqld_exporter/.my.cnf
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF
---------
systemctl daemon-reload
systemctl enable mysqld_exporter
systemctl start mysqld_exporter
3.blackbox exporter的安装
二进制安装
wget https://github.com/prometheus/blackbox_exporter/releases/download/v0.23.0/blackbox_exporter-0.23.0.linux-amd64.tar.gz
tar -zxvf blackbox_exporter-0.23.0.linux-amd64.tar.gz -C /usr/local
mv /usr/local/blackbox_exporter-0.23.0.linux-amd64.tar.gz /usr/local/blackbox_exporter
配置下当前探针 cat /usr/local/blackbox_exporter/blackbox.yml 探针类型(prober)
modules:
http_2xx:
prober: http
http_post_2xx:
prober: http
http:
method: POST
tcp_connect:
prober: tcp
pop3s_banner:
prober: tcp
tcp:
query_response:
- expect: "^+OK"
tls: true
tls_config:
insecure_skip_verify: false
grpc:
prober: grpc
grpc:
tls: true
preferred_ip_protocol: "ip4"
grpc_plain:
prober: grpc
grpc:
tls: false
service: "service1"
ssh_banner:
prober: tcp
tcp:
query_response:
- expect: "^SSH-2.0-"
- send: "SSH-2.0-blackbox-ssh-check"
irc_banner:
prober: tcp
tcp:
query_response:
- send: "NICK prober"
- send: "USER prober prober prober :prober"
- expect: "PING :([^ ]+)"
send: "PONG ${1}"
- expect: "^:[^ ]+ 001"
icmp:
prober: icmp
icmp_ttl5:
prober: icmp
timeout: 5s
icmp:
ttl: 5
添加到启动项 cat /usr/lib/systemd/system/blackbox_exporter.service
[Unit]
Description=blackbox_exporter
[Service]
User=root
Type=simple
ExecStart=/usr/local/blackbox_exporter/blackbox_exporter --config.file=/usr/local/blackbox_exporter/blackbox.yml
Restart=on-failure
检查是否正常运行
同时也可以通过访问 http://127.0.0.1:9115/probe?module=http_2xx&target=baidu.com 对baidu.com 进行探测
这里通过在URL中提供module参数指定了当前使用的探针,target参数指定探测目标,探针的探测结果通过Metrics的形式返回:
# HELP probe_dns_lookup_time_seconds Returns the time taken for probe dns lookup in seconds
# TYPE probe_dns_lookup_time_seconds gauge
probe_dns_lookup_time_seconds 0.004366919
# HELP probe_duration_seconds Returns how long the probe took to complete in seconds
# TYPE probe_duration_seconds gauge
probe_duration_seconds 0.09053371
# HELP probe_failed_due_to_regex Indicates if probe failed due to regex
# TYPE probe_failed_due_to_regex gauge
probe_failed_due_to_regex 0
# HELP probe_http_content_length Length of http content response
# TYPE probe_http_content_length gauge
probe_http_content_length 81
# HELP probe_http_duration_seconds Duration of http request by phase, summed over all redirects
# TYPE probe_http_duration_seconds gauge
probe_http_duration_seconds{phase="connect"} 0.040772637
probe_http_duration_seconds{phase="processing"} 0.04430544
probe_http_duration_seconds{phase="resolve"} 0.004366919
probe_http_duration_seconds{phase="tls"} 0
probe_http_duration_seconds{phase="transfer"} 0.00019256
# HELP probe_http_last_modified_timestamp_seconds Returns the Last-Modified HTTP response header in unixtime
# TYPE probe_http_last_modified_timestamp_seconds gauge
probe_http_last_modified_timestamp_seconds 1.26330408e+09
# HELP probe_http_redirects The number of redirects
# TYPE probe_http_redirects gauge
probe_http_redirects 0
# HELP probe_http_ssl Indicates if SSL was used for the final redirect
# TYPE probe_http_ssl gauge
probe_http_ssl 0
# HELP probe_http_status_code Response HTTP status code
# TYPE probe_http_status_code gauge
probe_http_status_code 200
# HELP probe_http_uncompressed_body_length Length of uncompressed response body
# TYPE probe_http_uncompressed_body_length gauge
probe_http_uncompressed_body_length 81
# HELP probe_http_version Returns the version of HTTP of the probe response
# TYPE probe_http_version gauge
probe_http_version 1.1
# HELP probe_ip_addr_hash Specifies the hash of IP address. It's useful to detect if the IP address changes.
# TYPE probe_ip_addr_hash gauge
probe_ip_addr_hash 3.6694721e+08
# HELP probe_ip_protocol Specifies whether probe ip protocol is IP4 or IP6
# TYPE probe_ip_protocol gauge
probe_ip_protocol 4
# HELP probe_success Displays whether or not the probe was a success
# TYPE probe_success gauge
probe_success 1
从返回的样本中,用户可以获取站点的DNS解析耗时、站点响应时间、HTTP响应状态码等等和站点访问质量相关的监控指标,从而帮助管理员主动的发现故障和问题。
追加到prometheus.yml
# 网站监控
- job_name: 'http_status'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- http://192.168.xx.xx:3928/anywhere/#/before #某网页
- https://www.baidu.com
instance: http_status
group: web
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 192.168.188.68:9115
# ping 检测
- job_name: 'ping_status'
metrics_path: /probe
params:
module: [icmp]
static_configs:
- targets:
- 192.168.188.200
labels:
instance: 'ping_status'
group: icmp
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 192.168.188.68:9115
# 端口监控
- job_name: 'port_status'
metrics_path: /probe
params:
module: [tcp_connect]
static_configs:
- targets:
- 192.168.188.10:3928
- 192.168.188.22:3306
- 192.168.188.200:8090
labels:
instance: 'port_status'
group: port
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 192.168.188.68:xx
grafana的模板号:9965
告警规则可以监控probe_success参数
icmp、tcp、http、post 监测是否正常可以观察 probe_success 这一指标
probe_success == 0 ##联通性异常
probe_success == 1 ##联通性正常
告警也是判断这个指标是否等于 0,如等于 0 则触发异常报警
大数据hadoop相关监控 搭建信息可参考:
https://github.com/tamtran96/hadoop-jmx-exporter/tree/master/dashboards
pushgateway 进行数据上报 https://cloud.tencent.com/developer/article/1531821
更多prometheus 相关exporter可参考:
https://blog.51cto.com/u_14065119/4166081
grafana 与prometheus 的Nginx反向代理参考链接:
https://blog.csdn.net/Rambo_Yang/article/details/108061345
https://grafana.com/tutorials/run-grafana-behind-a-proxy/#1
其他参考链接:
https://it.cha138.com/mysql/show-99068.html
https://www.infoq.cn/article/sxextntuttxduedeagiq
https://www.prometheus.wang/exporter/install_blackbox_exporter.html