Live Proxmox Monitoring
24-Hour CPU Usage History
Current CPU Usage
Monitoring Proxmox VE with Grafana Stack on LXC
Complete guide to set up a comprehensive monitoring solution for Proxmox using Grafana, Prometheus, Loki, and AlertManager in an LXC container
Garage985
DevOps Engineer & Writer
A complete guide to monitoring your Proxmox VE infrastructure using a dedicated LXC container running the full Grafana stack with Telegram and Slack alerting.
Why do we need monitoring on Proxmox?
Whether your Proxmox cluster is used for testing, learning, or full production, you need to clearly understand its current state and be able to react quickly to critical signals. Knowing why CPU usage is spiking, how much RAM is being consumed, and how much storage remains are essential metrics to keep your systems running smoothly and reliably.
Actively monitoring of your Proxmox resources ensures you’re the first to know when something goes wrong. It gives you the earliest possible warning and the time needed to take corrective actions before a small issue turns into a major outage or data loss.
Here are some issues can happen on your Proxmox:
- Disk failure / ZFS degraded
- Root filesystem full
- RAM/swap exhausted
- Rogue VM eating CPU
- Node overheating
- Backups silently failing
- Node down after update
- Crypto-miner in container
- Storage disconnected/full
- etc
Design the monitoring system
graph TB
subgraph proxmox["Proxmox VE Host<br/>192.168.100.4:8006"]
vm1[VM/LXC]
vm2[VM/LXC]
vm3[VM/LXC]
vmore[...]
end
subgraph grafana_stack["Grafana-Stack LXC<br/>192.168.100.40"]
pve[PVE Exporter :9221<br/>Pulls metrics from Proxmox API]
prometheus[Prometheus :9090<br/>- Scrapes metrics 15s interval<br/>- Stores data 15-day retention<br/>- Evaluates alert rules]
grafana[Grafana :3000<br/>Dashboards]
alertmanager[AlertManager :9093<br/>Notifications]
end
slack[Slack<br/>Critical]
telegram[Telegram<br/>Operational]
proxmox -->|HTTPS API<br/>read-only| pve
pve -->|metrics| prometheus
prometheus -->|queries| grafana
prometheus -->|alerts| alertmanager
alertmanager -->|critical alerts| slack
alertmanager -->|operational alerts| telegram
style proxmox fill:#e1f5ff,stroke:#0288d1,stroke-width:2px
style grafana_stack fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
style slack fill:#fff3e0,stroke:#ef6c00,stroke-width:2px
style telegram fill:#e8f5e9,stroke:#388e3c,stroke-width:2px
style pve fill:#fff9c4,stroke:#f9a825
style prometheus fill:#ffebee,stroke:#c62828
style grafana fill:#e3f2fd,stroke:#1565c0
style alertmanager fill:#fce4ec,stroke:#c2185b
Design notes
- Proxmox node (or entire cluster) exposes its metrics via the standard HTTPS API (port 8006) using a read-only API token – zero security risk.
- A lightweight LXC container (192.168.100.40) hosts the full observability stack:
- pve-exporter (or prometheus-pve-exporter) queries the Proxmox API every 15 seconds and translates nodes, VMs, containers, ZFS pools, replication state, backups, Ceph, etc. into Prometheus-compatible metrics.
- Prometheus scrapes the exporter, stores 15 days of history and evaluates your alerting rules.
- Grafana pulls data from Prometheus and displays beautiful, ready-made Proxmox dashboards (Glance, Node Overview, ZFS detail, etc.).
- Alertmanager receives firing alerts from Prometheus, groups them, silences noise, and routes them:
- Critical problems (disk failure, node down, pool degraded) → Slack
- Operational warnings (high load, backup failed, replication lag) → Telegram
Key Advantages & Improvements
| Category | Details |
|---|---|
| Key Advantages | |
| Simple Architecture | Easy to deploy and maintain (Prometheus + Grafana + AlertManager) |
| Lightweight | Runs efficiently inside one LXC with very low resource usage |
| Safe Integration | Uses read-only Proxmox API tokens — no agents required on host nodes |
| Centralized Monitoring | Automatically collects metrics from all nodes, VMs, and LXCs |
| Clear Alert Routing | Critical alerts → Slack Operational alerts → Telegram |
| Highly Extensible | Easy to add Node Exporter, Blackbox, SNMP, MySQL, Postgres exporters, etc. |
| Zero Cost | 100 % open-source — no licensing fees ever |
| Stunning Dashboards | Grafana delivers modern, interactive, and community-supported visualizations |
| Simple Backup & Restore | Single LXC = one-click snapshot and restore with Proxmox Backup Server |
| Room for Improvement | |
| No Built-in HA | Single LXC = single point of failure (mitigate with HA-enabled container or second node) |
| Persistent Storage | Mount a dedicated ZFS dataset or external volume for Prometheus TSDB |
| Configuration Backups | Put Grafana dashboards + Prometheus rules/alerts in Git |
| Additional Exporters | Add node_exporter, blackbox_exporter, and snmp_exporter for complete visibility |
| TLS & Authentication | Front with Traefik/Nginx + Authelia or OAuth2 for secure external access |
| Long-term Scaling | For >25 nodes or >1 year retention → switch to Thanos, Mimir, or VictoriaMetrics |
Now, we completed the ‘design’ phase, we need to move on to the ‘setup’ phase.
Prerequisites
This is exactly based on my local infrastructure:
- Proxmox VE 9.1.1 installed and running
- Debian 13 LXC template downloaded in Proxmox
- Basic understanding of Linux commands
- Telegram account (for alerts)
- Slack workspace (optional, for critical alerts)
Step 1: Create LXC Container
Create an unprivileged Debian 13 LXC container for the Grafana stack.
Specifications:
- VMID: 140
- Hostname: grafana-stack
- Template: Debian 13 standard
- CPU: 4 cores
- RAM: 8GB
- Disk: 50GB (local-zfs)
- Network: Static IP 192.168.100.40/24
# On Proxmox host
pct create 140 local:vztmpl/debian-13-standard_13.1-2_amd64.tar.zst \
--hostname grafana-stack \ # Container name
--cores 4 \
--memory 8192 \
--swap 2048 \
--rootfs local-zfs:50 \
--net0 name=eth0,bridge=vmbr0,ip=192.168.100.40/24,gw=192.168.100.1 \
--nameserver 8.8.8.8 \
--unprivileged 1 \ # Safer unprivileged container
--features nesting=0 \ # Nesting OFF - no docker will be installed in this container - container in a container
--sshkeys /root/.ssh/your-key \ # injects your workstation keys
--start 1 # Start container after createdHere is what you will have when check in Proxmox UI:

Step 2: Configure LXC Network
LXC containers don’t use cloud-init, so network configuration must be done manually.
# Access container console via Proxmox UI or:
pct enter 140
# Configure network
cat > /etc/network/interfaces << 'EOF'
auto lo
iface lo inet loopback
auto eth0
iface eth0 inet static
address 192.168.100.40
netmask 255.255.255.0
gateway 192.168.100.1
dns-nameservers 8.8.8.8 8.8.4.4
EOF
# Restart networking
systemctl restart networking
# Verify
ip addr show eth0
ping -c 3 8.8.8.8Step 3: Install Grafana Stack
Install all monitoring components using the automated bash script. You can download the grafana-stack-setup.sh script.
# Update system
apt update && apt upgrade -y
# Install dependencies
apt install -y apt-transport-https wget curl gnupg2 ca-certificates \
python3 python3-pip unzip
# Download installation script
wget https://gist.githubusercontent.com/sule9985/fabf9e4ebcd9bd93019bd0a5ada5d827/raw/8c7c3f8bf5aa28bba4585142ec876a001b18f63a/grafana-stack-setup.sh
chmod +x grafana-stack-setup.sh
# Run installation
./grafana-stack-setup.shThe script installs:
- Grafana 12.3.0 - Visualization platform
- Prometheus 3.7.3 - Metrics collection and storage
- Loki 3.6.0 - Log aggregation
- AlertManager 0.29.0 - Alert routing and notifications
- Proxmox PVE Exporter 3.5.5 - Proxmox metrics collector
Installation takes ~5-10 minutes and you can see the good results in Terminal like this:
=============================================
VERIFYING INSTALLATION
=============================================
[STEP] Checking service status...
✓ grafana-server: running
✓ prometheus: running
✓ loki: running
✓ alertmanager: running
! prometheus-pve-exporter: not configured
[STEP] Checking network connectivity...
✓ Port 3000 (Grafana): listening
✓ Port 9090 (Prometheus): listening
✓ Port 3100 (Loki): listening
✓ Port 9093 (AlertManager): listening
[SUCCESS] All services verified successfully!
[SUCCESS] Installation completed successfully in 53 seconds!Step 4: Create Proxmox Monitoring User
Create a read-only user on Proxmox for the PVE Exporter to collect metrics.
# SSH to Proxmox host
ssh root@192.168.100.4
# Create monitoring user
pveum user add grafana-user@pve --comment "Grafana monitoring user"
# Assign read-only permissions
pveum acl modify / --user grafana-user@pve --role PVEAuditor
# Create API token
pveum user token add grafana-user@pve grafana-token --privsep 0
# Save the token output!
# Example: 8a7b6c5d-1234-5678-90ab-cdef12345678Important: Save the full token value - it’s only shown once!
Step 5: Configure PVE Exporter
Configure the Proxmox PVE Exporter with the API token.
# On grafana-stack LXC
ssh -i PATH_TO_YOUR_KEY root@192.168.100.40
# Edit PVE exporter configuration
nano /etc/prometheus-pve-exporter/pve.ymlConfiguration:
default:
user: grafana-user@pve
# IMPORTANT: Create a read-only user on Proxmox for monitoring
# On Proxmox host:
# Then add the token here:
token_name: "grafana-token"
token_value: "TOKEN_VALUE"
# OR use password:
# password: "CHANGE_ME"
verify_ssl: false
# Target Proxmox hosts
pve1:
user: grafana-user@pve
token_name: "grafana-token"
token_value: "TOKEN_VALUE"
verify_ssl: false
target: https://192.168.100.4:8006Start the exporter:
# Start service
systemctl start prometheus-pve-exporter
# Verify it's working
root@grafana-stack:~# systemctl status prometheus-pve-exporter.service
● prometheus-pve-exporter.service - Prometheus Proxmox VE Exporter
Loaded: loaded (/etc/systemd/system/prometheus-pve-exporter.service; enabled; preset: enabled)
Active: active (running) since Sun 2025-11-23 11:22:06 +07; 4 days ago
Invocation: 1c35a29336b346e8b553b74a4d8fc533
Docs: https://github.com/prometheus-pve/prometheus-pve-exporter
Main PID: 10509 (pve_exporter)
Tasks: 4 (limit: 75893)
Memory: 44.4M (peak: 45.2M)
CPU: 27min 52.526s
CGroup: /system.slice/prometheus-pve-exporter.service
├─10509 /usr/bin/python3 /usr/local/bin/pve_exporter --config.file=/etc/prometheus-pve-exporter/pve.yml --web.listen-address=0.0.0.0:9221
└─10550 /usr/bin/python3 /usr/local/bin/pve_exporter --config.file=/etc/prometheus-pve-exporter/pve.yml --web.listen-address=0.0.0.0:9221Step 6: Configure Prometheus Scraping
Update Prometheus to scrape the PVE Exporter correctly.
# Edit Prometheus config
nano /etc/prometheus/prometheus.ymlAdd/update the Proxmox job:
scrape_configs:
# ──────────────────────────────────────────────────────────────
# Proxmox VE monitoring via pve-exporter (runs inside the LXC)
# ──────────────────────────────────────────────────────────────
- job_name: 'proxmox' # Friendly name shown in Prometheus/Grafana
metrics_path: '/pve' # Endpoint where pve-exporter serves Proxmox metrics
params:
target: ['192.168.100.4:8006'] # Your Proxmox node (or cluster) + GUI port
# Supports multiple nodes: ['node1:8006','node2:8006']
static_configs:
- targets: ['localhost:9221'] # Where pve-exporter is listening inside this LXC
labels:
service: 'proxmox-pve' # Custom label – helps filtering in Grafana
instance: 'pve-host' # Logical name for your cluster/nodeReload Prometheus:
# Restart Prometheus
systemctl restart prometheusCheck the prometheus service via the url: https:192.168.100.40:9090/targets, and it should show UP state.

And another check on the same Prometheus dashboard, click on the Query, then type pve_cpu_usage_limit, press Execute button, it should show the CPU usage limit:

This check is important because it make sure your setup is work correctly. A quick recap here:
- Proxmox server (v9.1.1)
- PVE user: grafana-user (Role as PVEAuditor)
- API Token: grafana-token
- Grafana-LXC (container)
- Grafana (v12.3.0)
- AlertManager (v0.29.0)
- Loki (v3.6.0)
- Prometheus (v3.7.3)
- Prometheus PVE Exporter (v3.5.5)
- Configurations:
/etc/prometheus-pve-exporter/pve.yml/etc/prometheus/prometheus.yml
Note: There is no agent/bot installed on Proxmox server.
Step 7: Import Grafana Dashboard
Access Grafana and import the official Proxmox dashboard:
- Login to Grafana via the URL:
http://192.168.100.40:3000, default credentials:admin/admin - Click Dashboards → New → Import
- Enter Dashboard ID:
10347 - Click Load
- Select Prometheus as the datasource
- Click Import

Step 8: Set Up Alerting
In this demo, we use both Slack and Telegram at a time to send notifications.
Create Telegram Bot
- Open Telegram, search for
@BotFather - Send
/newbot - Follow prompts to create bot
- Save the Bot Token
- Start chat with your bot
- Get Chat ID from:
https://api.telegram.org/bot<TOKEN>/getUpdates
Create Slack Webhook (Optional)
- Go to https://api.slack.com/apps
- Create New App → From scratch
- Enable Incoming Webhooks
- Add webhook to channel (e.g.,
#infrastructure-alerts) - Save the Webhook URL
Configure Alert Rules
Understanding the Alert Pipeline
Before configuring alerts, it’s important to understand the two-stage alerting architecture:
Prometheus Alert Rules (What to monitor):
- Define WHEN alerts should fire based on metric conditions
- Evaluate expressions like “CPU > 85%” or “Disk > 90%”
- Add labels to categorize alerts (severity, notification_channel)
- Run continuously at specified intervals
AlertManager Configuration (How to notify):
- Defines WHERE to send alerts (Slack, Telegram, email)
- Routes alerts based on labels (e.g.,
notification_channel: slack) - Groups similar alerts together to reduce noise
- Handles deduplication, silencing, and inhibition rules
Why this separation?
- Prometheus focuses on metric evaluation and detection
- AlertManager handles the complex logic of notification routing, grouping, and delivery
- Allows multiple Prometheus instances to share one AlertManager
- Provides flexibility to change notification channels without modifying detection rules
Our setup overview:
- Slack receives critical infrastructure alerts (host CPU, memory, disk)
- Telegram receives operational alerts (storage, VMs, containers)
- Alert rules include both warning (85%) and critical (95%) thresholds
- Inhibition rules suppress warning alerts when critical alerts are firing
flowchart TD
subgraph Prometheus["📊 Prometheus"]
A[Metrics Collection<br/>PVE Exporter] --> B[Alert Rules Evaluation<br/>Every 30s-1m]
B --> C{Condition<br/>Met?}
end
C -->|"Yes"| D[Send Alert to AlertManager]
C -->|"No"| E[Continue Monitoring]
E --> A
subgraph AlertManager["🔔 AlertManager"]
D --> F[Receive Alerts]
F --> G[Grouping & Deduplication<br/>group_wait: 10s-30s]
G --> H{Route by<br/>Label}
H --> I[Apply Inhibition Rules<br/>Suppress warnings if critical firing]
end
subgraph Examples["Example Alert Conditions"]
J1["CPU > 85% (warning)<br/>label: notification_channel=slack"]
J2["Storage > 80% (warning)<br/>label: notification_channel=telegram"]
J3["VM Down<br/>label: notification_channel=telegram"]
end
subgraph Notifications["📬 Notification Channels"]
K["🔵 Slack<br/>Critical Infrastructure<br/>• Host CPU/Memory/Disk<br/>• Repeat every 1h"]
L["💬 Telegram<br/>Operational Alerts<br/>• Storage Usage<br/>• VM/LXC Status<br/>• Repeat every 2h"]
end
I -->|"notification_channel:<br/>slack"| K
I -->|"notification_channel:<br/>telegram"| L
style Prometheus fill:#e1f5ff,stroke:#0066cc,stroke-width:2px
style AlertManager fill:#fff4e1,stroke:#ff9900,stroke-width:2px
style Examples fill:#f0f0f0,stroke:#666,stroke-width:1px,stroke-dasharray: 5 5
style Notifications fill:#e8f5e9,stroke:#00aa00,stroke-width:2px
style K fill:#4a90e2,color:#fff,stroke:#2563eb,stroke-width:2px
style L fill:#0088cc,color:#fff,stroke:#0066aa,stroke-width:2px
style C fill:#ffd700,stroke:#ff8800
style H fill:#ffd700,stroke:#ff8800
Create Prometheus alert rules:
# Create alert rules file
nano /etc/prometheus/rules/proxmox.ymlgroups:
# Group 1: Host Alerts (Slack - Critical Infrastructure)
- name: proxmox_host_alerts
interval: 30s
rules:
- alert: ProxmoxHostDown # Node unreachable
expr: pve_up{id="node/pve"} == 0
for: 1m
labels: { severity: critical, notification_channel: slack }
annotations:
summary: "Proxmox host is down"
description: "Proxmox host 'pve' is unreachable or down for >1min."
- alert: ProxmoxHighCPU # Warning at 85%
expr: pve_cpu_usage_ratio{id="node/pve"} > 0.85
for: 5m
labels: { severity: warning, notification_channel: slack }
annotations:
summary: "High CPU usage on Proxmox host"
description: "CPU usage is {{ $value | humanizePercentage }} (>85%)."
- alert: ProxmoxCriticalCPU # Critical at 95%
expr: pve_cpu_usage_ratio{id="node/pve"} > 0.95
for: 2m
labels: { severity: critical, notification_channel: slack }
annotations:
summary: "CRITICAL CPU usage on Proxmox host"
description: "CPU usage is {{ $value | humanizePercentage }} (>95%)."
- alert: ProxmoxHighMemory # Warning at 85%
expr: (pve_memory_usage_bytes{id="node/pve"} / pve_memory_size_bytes{id="node/pve"}) > 0.85
for: 5m
labels: { severity: warning, notification_channel: slack }
- alert: ProxmoxCriticalMemory # Critical at 95%
expr: (pve_memory_usage_bytes{id="node/pve"} / pve_memory_size_bytes{id="node/pve"}) > 0.95
for: 2m
labels: { severity: critical, notification_channel: slack }
- alert: ProxmoxHighDiskUsage # Root disk warning at 80%
expr: (pve_disk_usage_bytes{id="node/pve"} / pve_disk_size_bytes{id="node/pve"}) > 0.80
for: 10m
labels: { severity: warning, notification_channel: slack }
- alert: ProxmoxCriticalDiskUsage # Critical at 90%
expr: (pve_disk_usage_bytes{id="node/pve"} / pve_disk_size_bytes{id="node/pve"}) > 0.90
for: 5m
labels: { severity: critical, notification_channel: slack }
# Group 2: Storage Alerts (Telegram - Operational)
- name: proxmox_storage_alerts
interval: 1m
rules:
- alert: ProxmoxStorageHighUsage # Any ZFS/NFS storage >80%
expr: (pve_disk_usage_bytes{id=~"storage/.*"} / pve_disk_size_bytes{id=~"storage/.*"}) > 0.80
for: 10m
labels: { severity: warning, notification_channel: telegram }
annotations:
summary: "High usage on storage {{ $labels.storage }}"
- alert: ProxmoxStorageCriticalUsage # >90%
expr: (pve_disk_usage_bytes{id=~"storage/.*"} / pve_disk_size_bytes{id=~"storage/.*"}) > 0.90
for: 5m
labels: { severity: critical, notification_channel: telegram }
# Group 3: VM/LXC Alerts (Telegram - Operational)
- name: proxmox_vm_alerts
interval: 1m
rules:
- alert: ProxmoxVMDown # Any VM or LXC down
expr: pve_up{id=~"(qemu|lxc)/.*", template="0"} == 0
for: 5m
labels: { severity: warning, notification_channel: telegram }
annotations:
summary: "VM/Container {{ $labels.name }} is down"
- alert: ProxmoxVMHighCPU # Guest CPU >90%
expr: pve_cpu_usage_ratio{id=~"(qemu|lxc)/.*", template="0"} > 0.90
for: 10m
labels: { severity: warning, notification_channel: telegram }
- alert: ProxmoxVMHighMemory # Guest memory >90%
expr: (pve_memory_usage_bytes{id=~"(qemu|lxc)/.*", template="0"} / pve_memory_size_bytes{id=~"(qemu|lxc)/.*", template="0"}) > 0.90
for: 10m
labels: { severity: warning, notification_channel: telegram }Configure AlertManager
nano /etc/alertmanager/alertmanager.ymlglobal:
resolve_timeout: 5m
# Routing tree - directs alerts to receivers based on labels
route:
# Default grouping and timing
group_by: ['alertname', 'severity', 'alert_group']
group_wait: 10s # Wait before sending first notification
group_interval: 10s # Wait before sending notifications for new alerts in group
repeat_interval: 12h # Resend notification every 12 hours if still firing
# Default receiver for unmatched alerts
receiver: 'telegram-default'
# Child routes - matched in order, first match wins
routes:
# Route 1: Slack for host_alerts (critical infrastructure)
- match:
notification_channel: slack
receiver: 'slack-critical'
group_wait: 10s
group_interval: 10s
repeat_interval: 1h # Repeat every hour for critical infrastructure
continue: false # Stop matching after this route
# Route 2: Telegram for telegram channel alerts (storage & VMs)
- match:
notification_channel: telegram
receiver: 'telegram-operational'
group_wait: 30s
group_interval: 30s
repeat_interval: 2h # Repeat every 4 hours for operational alerts
continue: false
# Notification receivers
receivers:
# Slack receiver for critical infrastructure (host alerts)
- name: 'slack-critical'
slack_configs:
- api_url: 'SLACK_WEBHOOK'
channel: '#alerts-test'
username: 'Prometheus AlertManager'
icon_emoji: ':warning:'
title: '{{ .GroupLabels.alertname }} - {{ .GroupLabels.severity | toUpper }}'
text: |
{{ range .Alerts }}
*Alert:* {{ .Labels.alertname }}
*Severity:* {{ .Labels.severity }}
*Component:* {{ .Labels.component }}
*Summary:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
{{ end }}
send_resolved: true
# Optional: Mention users for critical alerts
# color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
# Telegram receiver for operational alerts (storage & VM)
- name: 'telegram-operational'
telegram_configs:
- bot_token: 'BOT_TOKEN'
chat_id: CHAT_ID_NUMBERS
parse_mode: 'HTML'
message: |
{{ range .Alerts }}
<b>{{ .Labels.severity | toUpper }}: {{ .Labels.alertname }}</b>
{{ .Annotations.summary }}
<b>Details:</b>
{{ .Annotations.description }}
<b>Component:</b> {{ .Labels.component }}
<b>Group:</b> {{ .Labels.alert_group }}
<b>Status:</b> {{ .Status }}
{{ end }}
send_resolved: true
# Default Telegram receiver (fallback)
- name: 'telegram-default'
telegram_configs:
- bot_token: 'BOT_TOKEN'
chat_id: CHAT_ID_NUMBERS
parse_mode: 'HTML'
message: |
{{ range .Alerts }}
<b>{{ .Labels.severity | toUpper }}: {{ .Labels.alertname }}</b>
{{ .Annotations.summary }}
{{ .Annotations.description }}
<b>Component:</b> {{ .Labels.component }}
{{ end }}
send_resolved: true
# Inhibition rules - suppress alerts based on other alerts
inhibit_rules:
# If critical alert is firing, suppress warning alerts for same component
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['component', 'alertname']Reload services:
# Validate configs
promtool check rules /etc/prometheus/rules/proxmox.yml
amtool check-config /etc/alertmanager/alertmanager.yml
# Reload
curl -X POST http://localhost:9090/-/reload
systemctl restart alertmanagerImage placeholder: Screenshot of Telegram showing test alert notification
Step 9: Test Alerts
Send test alerts to verify routing works.
# Test Slack alert
curl -X POST http://localhost:9093/api/v1/alerts -d '[{
"labels": {
"alertname": "TestSlack",
"notification_channel": "slack",
"severity": "warning"
},
"annotations": {
"summary": "Test Slack Alert",
"description": "This is a test"
}
}]'
# Test Telegram alert
curl -X POST http://localhost:9093/api/v1/alerts -d '[{
"labels": {
"alertname": "TestTelegram",
"notification_channel": "telegram",
"severity": "warning"
},
"annotations": {
"summary": "Test Telegram Alert",
"description": "This is a test"
}
}]'Verify alerts appear in:
- Prometheus: http://192.168.100.40:9090/alerts
- AlertManager: http://192.168.100.40:9093
- Slack/Telegram channels
Image placeholder: Screenshot showing alerts in Prometheus UI
Monitoring Metrics
Key metrics available:
Host Metrics:
pve_cpu_usage_ratio- CPU usage (0-1)pve_memory_usage_bytes/pve_memory_size_bytes- Memory usagepve_disk_usage_bytes/pve_disk_size_bytes- Disk usagepve_up{id="node/pve"}- Host availability
VM/Container Metrics:
pve_up{id=~"(qemu|lxc)/.*"}- VM/Container statuspve_cpu_usage_ratio{id=~"qemu/.*"}- Per-VM CPUpve_memory_usage_bytes{id=~"qemu/.*"}- Per-VM memorypve_guest_info- VM/Container metadata
Storage Metrics:
pve_disk_usage_bytes{id=~"storage/.*"}- Storage pool usagepve_storage_info- Storage pool information
Image placeholder: Screenshot of Grafana showing multiple metric panels
Alert Rules Summary
| Alert | Threshold | Duration | Channel | Severity |
|---|---|---|---|---|
| ProxmoxHighCPU | >85% | 5 min | Slack | Warning |
| ProxmoxCriticalCPU | >95% | 2 min | Slack | Critical |
| ProxmoxHighMemory | >85% | 5 min | Slack | Warning |
| ProxmoxCriticalMemory | >95% | 2 min | Slack | Critical |
| ProxmoxHighDisk | >80% | 10 min | Slack | Warning |
| ProxmoxStorageHigh | >80% | 10 min | Telegram | Warning |
| ProxmoxVMDown | == 0 | 5 min | Telegram | Warning |
| ProxmoxVMHighCPU | >90% | 10 min | Telegram | Warning |
Access URLs
| Service | URL | Credentials |
|---|---|---|
| Grafana | http://192.168.100.40:3000 | admin / admin |
| Prometheus | http://192.168.100.40:9090 | - |
| AlertManager | http://192.168.100.40:9093 | - |
| PVE Exporter | http://192.168.100.40:9221/metrics | - |
Maintenance
Check service status:
systemctl status grafana-server prometheus loki alertmanager prometheus-pve-exporterView logs:
journalctl -u prometheus -f
journalctl -u alertmanager -f
journalctl -u prometheus-pve-exporter -fBackup configurations:
tar -czf grafana-stack-backup.tar.gz \
/etc/prometheus \
/etc/loki \
/etc/alertmanager \
/etc/grafana \
/etc/prometheus-pve-exporterUpdate retention:
# Prometheus (default: 15 days)
nano /etc/systemd/system/prometheus.service
# Edit: --storage.tsdb.retention.time=30d
systemctl daemon-reload
systemctl restart prometheusTroubleshooting
No metrics in Grafana:
- Check Prometheus targets: http://192.168.100.40:9090/targets
- Verify PVE Exporter:
curl http://localhost:9221/metrics | grep pve_up - Check Proxmox permissions:
pveum acl list | grep grafana-user
Alerts not firing:
- Check alert rules: http://192.168.100.40:9090/alerts
- Verify AlertManager: http://192.168.100.40:9093
- Test notification channels with curl commands above
PVE Exporter shows pve_up = 0:
- Verify Proxmox is accessible:
curl -k https://192.168.100.4:8006 - Check API token is correct in
/etc/prometheus-pve-exporter/pve.yml - Verify user has PVEAuditor role:
pveum acl list
Recommended Dashboards
Import these additional Grafana dashboards:
- 10347 - Proxmox VE (official, comprehensive)
- 13865 - JMeter Load Testing (if using JMeter)
- 1860 - Node Exporter (Linux host metrics)
- 15356 - Proxmox Multi-Server (for multiple hosts)
Conclusion
You now have a complete monitoring solution for Proxmox VE with:
- ✅ Real-time metrics visualization in Grafana
- ✅ 15 days of metrics history in Prometheus
- ✅ 7 days of log storage in Loki
- ✅ Smart alert routing (Slack for critical, Telegram for operational)
- ✅ Comprehensive dashboards for host, VM, and storage monitoring
This setup provides enterprise-grade monitoring for your Proxmox infrastructure with minimal resource overhead (4 CPU, 8GB RAM for the entire stack).
Resources
Share this post
Found this helpful? Share it with your network!