A complete guide to monitoring your Proxmox VE infrastructure using a dedicated LXC container running the full Grafana stack with Telegram and Slack alerting.

Why do we need monitoring on Proxmox?

Whether your Proxmox cluster is used for testing, learning, or full production, you need to clearly understand its current state and be able to react quickly to critical signals. Knowing why CPU usage is spiking, how much RAM is being consumed, and how much storage remains are essential metrics to keep your systems running smoothly and reliably.

Actively monitoring of your Proxmox resources ensures you’re the first to know when something goes wrong. It gives you the earliest possible warning and the time needed to take corrective actions before a small issue turns into a major outage or data loss.

Here are some issues can happen on your Proxmox:

  • Disk failure / ZFS degraded
  • Root filesystem full
  • RAM/swap exhausted
  • Rogue VM eating CPU
  • Node overheating
  • Backups silently failing
  • Node down after update
  • Crypto-miner in container
  • Storage disconnected/full
  • etc

Design the monitoring system

graph TB
  subgraph proxmox["Proxmox VE Host<br/>192.168.100.4:8006"]
      vm1[VM/LXC]
      vm2[VM/LXC]
      vm3[VM/LXC]
      vmore[...]
  end

  subgraph grafana_stack["Grafana-Stack LXC<br/>192.168.100.40"]
      pve[PVE Exporter :9221<br/>Pulls metrics from Proxmox API]

      prometheus[Prometheus :9090<br/>- Scrapes metrics 15s interval<br/>- Stores data 15-day retention<br/>- Evaluates alert rules]

      grafana[Grafana :3000<br/>Dashboards]

      alertmanager[AlertManager :9093<br/>Notifications]
  end

  slack[Slack<br/>Critical]
  telegram[Telegram<br/>Operational]

  proxmox -->|HTTPS API<br/>read-only| pve
  pve -->|metrics| prometheus
  prometheus -->|queries| grafana
  prometheus -->|alerts| alertmanager
  alertmanager -->|critical alerts| slack
  alertmanager -->|operational alerts| telegram

  style proxmox fill:#e1f5ff,stroke:#0288d1,stroke-width:2px
  style grafana_stack fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
  style slack fill:#fff3e0,stroke:#ef6c00,stroke-width:2px
  style telegram fill:#e8f5e9,stroke:#388e3c,stroke-width:2px
  style pve fill:#fff9c4,stroke:#f9a825
  style prometheus fill:#ffebee,stroke:#c62828
  style grafana fill:#e3f2fd,stroke:#1565c0
  style alertmanager fill:#fce4ec,stroke:#c2185b

Design notes

  • Proxmox node (or entire cluster) exposes its metrics via the standard HTTPS API (port 8006) using a read-only API token – zero security risk.
  • A lightweight LXC container (192.168.100.40) hosts the full observability stack:
    • pve-exporter (or prometheus-pve-exporter) queries the Proxmox API every 15 seconds and translates nodes, VMs, containers, ZFS pools, replication state, backups, Ceph, etc. into Prometheus-compatible metrics.
    • Prometheus scrapes the exporter, stores 15 days of history and evaluates your alerting rules.
    • Grafana pulls data from Prometheus and displays beautiful, ready-made Proxmox dashboards (Glance, Node Overview, ZFS detail, etc.).
    • Alertmanager receives firing alerts from Prometheus, groups them, silences noise, and routes them:
      • Critical problems (disk failure, node down, pool degraded) → Slack
      • Operational warnings (high load, backup failed, replication lag) → Telegram

Key Advantages & Improvements

CategoryDetails
Key Advantages
Simple ArchitectureEasy to deploy and maintain (Prometheus + Grafana + AlertManager)
LightweightRuns efficiently inside one LXC with very low resource usage
Safe IntegrationUses read-only Proxmox API tokens — no agents required on host nodes
Centralized MonitoringAutomatically collects metrics from all nodes, VMs, and LXCs
Clear Alert RoutingCritical alerts → Slack
Operational alerts → Telegram
Highly ExtensibleEasy to add Node Exporter, Blackbox, SNMP, MySQL, Postgres exporters, etc.
Zero Cost100 % open-source — no licensing fees ever
Stunning DashboardsGrafana delivers modern, interactive, and community-supported visualizations
Simple Backup & RestoreSingle LXC = one-click snapshot and restore with Proxmox Backup Server
Room for Improvement
No Built-in HASingle LXC = single point of failure (mitigate with HA-enabled container or second node)
Persistent StorageMount a dedicated ZFS dataset or external volume for Prometheus TSDB
Configuration BackupsPut Grafana dashboards + Prometheus rules/alerts in Git
Additional ExportersAdd node_exporter, blackbox_exporter, and snmp_exporter for complete visibility
TLS & AuthenticationFront with Traefik/Nginx + Authelia or OAuth2 for secure external access
Long-term ScalingFor >25 nodes or >1 year retention → switch to Thanos, Mimir, or VictoriaMetrics

Now, we completed the ‘design’ phase, we need to move on to the ‘setup’ phase.

Prerequisites

This is exactly based on my local infrastructure:

  • Proxmox VE 9.1.1 installed and running
  • Debian 13 LXC template downloaded in Proxmox
  • Basic understanding of Linux commands
  • Telegram account (for alerts)
  • Slack workspace (optional, for critical alerts)

Step 1: Create LXC Container

Create an unprivileged Debian 13 LXC container for the Grafana stack.

Specifications:

  • VMID: 140
  • Hostname: grafana-stack
  • Template: Debian 13 standard
  • CPU: 4 cores
  • RAM: 8GB
  • Disk: 50GB (local-zfs)
  • Network: Static IP 192.168.100.40/24
# On Proxmox host
pct create 140 local:vztmpl/debian-13-standard_13.1-2_amd64.tar.zst \
  --hostname grafana-stack \      # Container name
  --cores 4 \                     
  --memory 8192 \
  --swap 2048 \
  --rootfs local-zfs:50 \
  --net0 name=eth0,bridge=vmbr0,ip=192.168.100.40/24,gw=192.168.100.1 \
  --nameserver 8.8.8.8 \
  --unprivileged 1 \              # Safer unprivileged container
  --features nesting=0 \          # Nesting OFF - no docker will be installed in this container - container in a container
  --sshkeys /root/.ssh/your-key \ # injects your workstation keys
  --start 1                       # Start container after created

Here is what you will have when check in Proxmox UI:

grafana-lxc on proxmox

Step 2: Configure LXC Network

LXC containers don’t use cloud-init, so network configuration must be done manually.

# Access container console via Proxmox UI or:
pct enter 140

# Configure network
cat > /etc/network/interfaces << 'EOF'
auto lo
iface lo inet loopback

auto eth0
iface eth0 inet static
    address 192.168.100.40
    netmask 255.255.255.0
    gateway 192.168.100.1
    dns-nameservers 8.8.8.8 8.8.4.4
EOF

# Restart networking
systemctl restart networking

# Verify
ip addr show eth0
ping -c 3 8.8.8.8

Step 3: Install Grafana Stack

Install all monitoring components using the automated bash script. You can download the grafana-stack-setup.sh script.

# Update system
apt update && apt upgrade -y

# Install dependencies
apt install -y apt-transport-https wget curl gnupg2 ca-certificates \
  python3 python3-pip unzip

# Download installation script
wget https://gist.githubusercontent.com/sule9985/fabf9e4ebcd9bd93019bd0a5ada5d827/raw/8c7c3f8bf5aa28bba4585142ec876a001b18f63a/grafana-stack-setup.sh
chmod +x grafana-stack-setup.sh

# Run installation
./grafana-stack-setup.sh

The script installs:

  • Grafana 12.3.0 - Visualization platform
  • Prometheus 3.7.3 - Metrics collection and storage
  • Loki 3.6.0 - Log aggregation
  • AlertManager 0.29.0 - Alert routing and notifications
  • Proxmox PVE Exporter 3.5.5 - Proxmox metrics collector

Installation takes ~5-10 minutes and you can see the good results in Terminal like this:

=============================================
VERIFYING INSTALLATION
=============================================


[STEP] Checking service status...

 grafana-server: running
 prometheus: running
 loki: running
 alertmanager: running
  ! prometheus-pve-exporter: not configured

[STEP] Checking network connectivity...

 Port 3000 (Grafana): listening
 Port 9090 (Prometheus): listening
 Port 3100 (Loki): listening
 Port 9093 (AlertManager): listening

[SUCCESS] All services verified successfully!

[SUCCESS] Installation completed successfully in 53 seconds!

Step 4: Create Proxmox Monitoring User

Create a read-only user on Proxmox for the PVE Exporter to collect metrics.

# SSH to Proxmox host
ssh root@192.168.100.4

# Create monitoring user
pveum user add grafana-user@pve --comment "Grafana monitoring user"

# Assign read-only permissions
pveum acl modify / --user grafana-user@pve --role PVEAuditor

# Create API token
pveum user token add grafana-user@pve grafana-token --privsep 0

# Save the token output!
# Example: 8a7b6c5d-1234-5678-90ab-cdef12345678

Important: Save the full token value - it’s only shown once!

Step 5: Configure PVE Exporter

Configure the Proxmox PVE Exporter with the API token.

# On grafana-stack LXC
ssh -i PATH_TO_YOUR_KEY root@192.168.100.40

# Edit PVE exporter configuration
nano /etc/prometheus-pve-exporter/pve.yml

Configuration:

default:
  user: grafana-user@pve
  # IMPORTANT: Create a read-only user on Proxmox for monitoring
  # On Proxmox host:
  # Then add the token here:
  token_name: "grafana-token"
  token_value: "TOKEN_VALUE"
  # OR use password:
  # password: "CHANGE_ME"
  verify_ssl: false

# Target Proxmox hosts
pve1:
  user: grafana-user@pve
  token_name: "grafana-token"
  token_value: "TOKEN_VALUE"
  verify_ssl: false
  target: https://192.168.100.4:8006

Start the exporter:

# Start service
systemctl start prometheus-pve-exporter

# Verify it's working
root@grafana-stack:~# systemctl status prometheus-pve-exporter.service 
 prometheus-pve-exporter.service - Prometheus Proxmox VE Exporter
     Loaded: loaded (/etc/systemd/system/prometheus-pve-exporter.service; enabled; preset: enabled)
     Active: active (running) since Sun 2025-11-23 11:22:06 +07; 4 days ago
 Invocation: 1c35a29336b346e8b553b74a4d8fc533
       Docs: https://github.com/prometheus-pve/prometheus-pve-exporter
   Main PID: 10509 (pve_exporter)
      Tasks: 4 (limit: 75893)
     Memory: 44.4M (peak: 45.2M)
        CPU: 27min 52.526s
     CGroup: /system.slice/prometheus-pve-exporter.service
             ├─10509 /usr/bin/python3 /usr/local/bin/pve_exporter --config.file=/etc/prometheus-pve-exporter/pve.yml --web.listen-address=0.0.0.0:9221
             └─10550 /usr/bin/python3 /usr/local/bin/pve_exporter --config.file=/etc/prometheus-pve-exporter/pve.yml --web.listen-address=0.0.0.0:9221

Step 6: Configure Prometheus Scraping

Update Prometheus to scrape the PVE Exporter correctly.

# Edit Prometheus config
nano /etc/prometheus/prometheus.yml

Add/update the Proxmox job:

scrape_configs:
  # ──────────────────────────────────────────────────────────────
  # Proxmox VE monitoring via pve-exporter (runs inside the LXC)
  # ──────────────────────────────────────────────────────────────
  - job_name: 'proxmox'                  # Friendly name shown in Prometheus/Grafana
    metrics_path: '/pve'                 # Endpoint where pve-exporter serves Proxmox metrics
    params:
      target: ['192.168.100.4:8006']     # Your Proxmox node (or cluster) + GUI port
                                         # Supports multiple nodes: ['node1:8006','node2:8006']
    static_configs:
      - targets: ['localhost:9221']      # Where pve-exporter is listening inside this LXC
        labels:
          service: 'proxmox-pve'         # Custom label – helps filtering in Grafana
          instance: 'pve-host'           # Logical name for your cluster/node

Reload Prometheus:

# Restart Prometheus
systemctl restart prometheus

Check the prometheus service via the url: https:192.168.100.40:9090/targets, and it should show UP state.

prometheus targets

And another check on the same Prometheus dashboard, click on the Query, then type pve_cpu_usage_limit, press Execute button, it should show the CPU usage limit:

prometheus query pve cpu usage limit

This check is important because it make sure your setup is work correctly. A quick recap here:

  • Proxmox server (v9.1.1)
    • PVE user: grafana-user (Role as PVEAuditor)
    • API Token: grafana-token
  • Grafana-LXC (container)
    • Grafana (v12.3.0)
    • AlertManager (v0.29.0)
    • Loki (v3.6.0)
    • Prometheus (v3.7.3)
    • Prometheus PVE Exporter (v3.5.5)
    • Configurations:
      • /etc/prometheus-pve-exporter/pve.yml
      • /etc/prometheus/prometheus.yml

Note: There is no agent/bot installed on Proxmox server.

Step 7: Import Grafana Dashboard

Access Grafana and import the official Proxmox dashboard:

  • Login to Grafana via the URL: http://192.168.100.40:3000, default credentials: admin/admin
  • Click DashboardsNewImport
  • Enter Dashboard ID: 10347
  • Click Load
  • Select Prometheus as the datasource
  • Click Import
grafana dashboard pve host

Step 8: Set Up Alerting

In this demo, we use both Slack and Telegram at a time to send notifications.

Create Telegram Bot

  • Open Telegram, search for @BotFather
  • Send /newbot
  • Follow prompts to create bot
  • Save the Bot Token
  • Start chat with your bot
  • Get Chat ID from: https://api.telegram.org/bot<TOKEN>/getUpdates

Create Slack Webhook (Optional)

  • Go to https://api.slack.com/apps
  • Create New App → From scratch
  • Enable Incoming Webhooks
  • Add webhook to channel (e.g., #infrastructure-alerts)
  • Save the Webhook URL

Configure Alert Rules

Understanding the Alert Pipeline

Before configuring alerts, it’s important to understand the two-stage alerting architecture:

Prometheus Alert Rules (What to monitor):

  • Define WHEN alerts should fire based on metric conditions
  • Evaluate expressions like “CPU > 85%” or “Disk > 90%”
  • Add labels to categorize alerts (severity, notification_channel)
  • Run continuously at specified intervals

AlertManager Configuration (How to notify):

  • Defines WHERE to send alerts (Slack, Telegram, email)
  • Routes alerts based on labels (e.g., notification_channel: slack)
  • Groups similar alerts together to reduce noise
  • Handles deduplication, silencing, and inhibition rules

Why this separation?

  • Prometheus focuses on metric evaluation and detection
  • AlertManager handles the complex logic of notification routing, grouping, and delivery
  • Allows multiple Prometheus instances to share one AlertManager
  • Provides flexibility to change notification channels without modifying detection rules

Our setup overview:

  • Slack receives critical infrastructure alerts (host CPU, memory, disk)
  • Telegram receives operational alerts (storage, VMs, containers)
  • Alert rules include both warning (85%) and critical (95%) thresholds
  • Inhibition rules suppress warning alerts when critical alerts are firing
flowchart TD
  subgraph Prometheus["📊 Prometheus"]
      A[Metrics Collection<br/>PVE Exporter] --> B[Alert Rules Evaluation<br/>Every 30s-1m]
      B --> C{Condition<br/>Met?}
  end

  C -->|"Yes"| D[Send Alert to AlertManager]
  C -->|"No"| E[Continue Monitoring]
  E --> A

  subgraph AlertManager["🔔 AlertManager"]
      D --> F[Receive Alerts]
      F --> G[Grouping & Deduplication<br/>group_wait: 10s-30s]
      G --> H{Route by<br/>Label}
      H --> I[Apply Inhibition Rules<br/>Suppress warnings if critical firing]
  end

  subgraph Examples["Example Alert Conditions"]
      J1["CPU > 85% (warning)<br/>label: notification_channel=slack"]
      J2["Storage > 80% (warning)<br/>label: notification_channel=telegram"]
      J3["VM Down<br/>label: notification_channel=telegram"]
  end

  subgraph Notifications["📬 Notification Channels"]
      K["🔵 Slack<br/>Critical Infrastructure<br/>• Host CPU/Memory/Disk<br/>• Repeat every 1h"]
      L["💬 Telegram<br/>Operational Alerts<br/>• Storage Usage<br/>• VM/LXC Status<br/>• Repeat every 2h"]
  end

  I -->|"notification_channel:<br/>slack"| K
  I -->|"notification_channel:<br/>telegram"| L

  style Prometheus fill:#e1f5ff,stroke:#0066cc,stroke-width:2px
  style AlertManager fill:#fff4e1,stroke:#ff9900,stroke-width:2px
  style Examples fill:#f0f0f0,stroke:#666,stroke-width:1px,stroke-dasharray: 5 5
  style Notifications fill:#e8f5e9,stroke:#00aa00,stroke-width:2px
  style K fill:#4a90e2,color:#fff,stroke:#2563eb,stroke-width:2px
  style L fill:#0088cc,color:#fff,stroke:#0066aa,stroke-width:2px
  style C fill:#ffd700,stroke:#ff8800
  style H fill:#ffd700,stroke:#ff8800

Create Prometheus alert rules:

# Create alert rules file
nano /etc/prometheus/rules/proxmox.yml
groups:
  # Group 1: Host Alerts (Slack - Critical Infrastructure)
  - name: proxmox_host_alerts
    interval: 30s
    rules:
      - alert: ProxmoxHostDown                # Node unreachable
        expr: pve_up{id="node/pve"} == 0
        for: 1m
        labels: { severity: critical, notification_channel: slack }
        annotations:
          summary: "Proxmox host is down"
          description: "Proxmox host 'pve' is unreachable or down for >1min."

      - alert: ProxmoxHighCPU                  # Warning at 85%
        expr: pve_cpu_usage_ratio{id="node/pve"} > 0.85
        for: 5m
        labels: { severity: warning, notification_channel: slack }
        annotations:
          summary: "High CPU usage on Proxmox host"
          description: "CPU usage is {{ $value | humanizePercentage }} (>85%)."

      - alert: ProxmoxCriticalCPU              # Critical at 95%
        expr: pve_cpu_usage_ratio{id="node/pve"} > 0.95
        for: 2m
        labels: { severity: critical, notification_channel: slack }
        annotations:
          summary: "CRITICAL CPU usage on Proxmox host"
          description: "CPU usage is {{ $value | humanizePercentage }} (>95%)."

      - alert: ProxmoxHighMemory               # Warning at 85%
        expr: (pve_memory_usage_bytes{id="node/pve"} / pve_memory_size_bytes{id="node/pve"}) > 0.85
        for: 5m
        labels: { severity: warning, notification_channel: slack }

      - alert: ProxmoxCriticalMemory           # Critical at 95%
        expr: (pve_memory_usage_bytes{id="node/pve"} / pve_memory_size_bytes{id="node/pve"}) > 0.95
        for: 2m
        labels: { severity: critical, notification_channel: slack }

      - alert: ProxmoxHighDiskUsage            # Root disk warning at 80%
        expr: (pve_disk_usage_bytes{id="node/pve"} / pve_disk_size_bytes{id="node/pve"}) > 0.80
        for: 10m
        labels: { severity: warning, notification_channel: slack }

      - alert: ProxmoxCriticalDiskUsage       # Critical at 90%
        expr: (pve_disk_usage_bytes{id="node/pve"} / pve_disk_size_bytes{id="node/pve"}) > 0.90
        for: 5m
        labels: { severity: critical, notification_channel: slack }

  # Group 2: Storage Alerts (Telegram - Operational)
  - name: proxmox_storage_alerts
    interval: 1m
    rules:
      - alert: ProxmoxStorageHighUsage         # Any ZFS/NFS storage >80%
        expr: (pve_disk_usage_bytes{id=~"storage/.*"} / pve_disk_size_bytes{id=~"storage/.*"}) > 0.80
        for: 10m
        labels: { severity: warning, notification_channel: telegram }
        annotations:
          summary: "High usage on storage {{ $labels.storage }}"

      - alert: ProxmoxStorageCriticalUsage     # >90%
        expr: (pve_disk_usage_bytes{id=~"storage/.*"} / pve_disk_size_bytes{id=~"storage/.*"}) > 0.90
        for: 5m
        labels: { severity: critical, notification_channel: telegram }

  # Group 3: VM/LXC Alerts (Telegram - Operational)
  - name: proxmox_vm_alerts
    interval: 1m
    rules:
      - alert: ProxmoxVMDown                   # Any VM or LXC down
        expr: pve_up{id=~"(qemu|lxc)/.*", template="0"} == 0
        for: 5m
        labels: { severity: warning, notification_channel: telegram }
        annotations:
          summary: "VM/Container {{ $labels.name }} is down"

      - alert: ProxmoxVMHighCPU                # Guest CPU >90%
        expr: pve_cpu_usage_ratio{id=~"(qemu|lxc)/.*", template="0"} > 0.90
        for: 10m
        labels: { severity: warning, notification_channel: telegram }

      - alert: ProxmoxVMHighMemory             # Guest memory >90%
        expr: (pve_memory_usage_bytes{id=~"(qemu|lxc)/.*", template="0"} / pve_memory_size_bytes{id=~"(qemu|lxc)/.*", template="0"}) > 0.90
        for: 10m
        labels: { severity: warning, notification_channel: telegram }

Configure AlertManager

nano /etc/alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m

# Routing tree - directs alerts to receivers based on labels
route:
  # Default grouping and timing
  group_by: ['alertname', 'severity', 'alert_group']
  group_wait: 10s        # Wait before sending first notification
  group_interval: 10s    # Wait before sending notifications for new alerts in group
  repeat_interval: 12h   # Resend notification every 12 hours if still firing

  # Default receiver for unmatched alerts
  receiver: 'telegram-default'

  # Child routes - matched in order, first match wins
  routes:
    # Route 1: Slack for host_alerts (critical infrastructure)
    - match:
        notification_channel: slack
      receiver: 'slack-critical'
      group_wait: 10s
      group_interval: 10s
      repeat_interval: 1h  # Repeat every hour for critical infrastructure
      continue: false      # Stop matching after this route

    # Route 2: Telegram for telegram channel alerts (storage & VMs)
    - match:
        notification_channel: telegram
      receiver: 'telegram-operational'
      group_wait: 30s
      group_interval: 30s
      repeat_interval: 2h  # Repeat every 4 hours for operational alerts
      continue: false

# Notification receivers
receivers:
  # Slack receiver for critical infrastructure (host alerts)
  - name: 'slack-critical'
    slack_configs:
      - api_url: 'SLACK_WEBHOOK'
        channel: '#alerts-test'
        username: 'Prometheus AlertManager'
        icon_emoji: ':warning:'
        title: '{{ .GroupLabels.alertname }} - {{ .GroupLabels.severity | toUpper }}'
        text: |
          {{ range .Alerts }}
          *Alert:* {{ .Labels.alertname }}
          *Severity:* {{ .Labels.severity }}
          *Component:* {{ .Labels.component }}
          *Summary:* {{ .Annotations.summary }}
          *Description:* {{ .Annotations.description }}
          {{ end }}
        send_resolved: true
        # Optional: Mention users for critical alerts
        # color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'

  # Telegram receiver for operational alerts (storage & VM)
  - name: 'telegram-operational'
    telegram_configs:
      - bot_token: 'BOT_TOKEN'
        chat_id: CHAT_ID_NUMBERS
        parse_mode: 'HTML'
        message: |
          {{ range .Alerts }}
          <b>{{ .Labels.severity | toUpper }}: {{ .Labels.alertname }}</b>

          {{ .Annotations.summary }}

          <b>Details:</b>
          {{ .Annotations.description }}

          <b>Component:</b> {{ .Labels.component }}
          <b>Group:</b> {{ .Labels.alert_group }}
          <b>Status:</b> {{ .Status }}
          {{ end }}
        send_resolved: true

  # Default Telegram receiver (fallback)
  - name: 'telegram-default'
    telegram_configs:
      - bot_token: 'BOT_TOKEN'
        chat_id: CHAT_ID_NUMBERS
        parse_mode: 'HTML'
        message: |
          {{ range .Alerts }}
          <b>{{ .Labels.severity | toUpper }}: {{ .Labels.alertname }}</b>

          {{ .Annotations.summary }}
          {{ .Annotations.description }}

          <b>Component:</b> {{ .Labels.component }}
          {{ end }}
        send_resolved: true

# Inhibition rules - suppress alerts based on other alerts
inhibit_rules:
  # If critical alert is firing, suppress warning alerts for same component
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['component', 'alertname']

Reload services:

# Validate configs
promtool check rules /etc/prometheus/rules/proxmox.yml
amtool check-config /etc/alertmanager/alertmanager.yml

# Reload
curl -X POST http://localhost:9090/-/reload
systemctl restart alertmanager

Image placeholder: Screenshot of Telegram showing test alert notification

Step 9: Test Alerts

Send test alerts to verify routing works.

# Test Slack alert
curl -X POST http://localhost:9093/api/v1/alerts -d '[{
  "labels": {
    "alertname": "TestSlack",
    "notification_channel": "slack",
    "severity": "warning"
  },
  "annotations": {
    "summary": "Test Slack Alert",
    "description": "This is a test"
  }
}]'

# Test Telegram alert
curl -X POST http://localhost:9093/api/v1/alerts -d '[{
  "labels": {
    "alertname": "TestTelegram",
    "notification_channel": "telegram",
    "severity": "warning"
  },
  "annotations": {
    "summary": "Test Telegram Alert",
    "description": "This is a test"
  }
}]'

Verify alerts appear in:

Image placeholder: Screenshot showing alerts in Prometheus UI

Monitoring Metrics

Key metrics available:

Host Metrics:

  • pve_cpu_usage_ratio - CPU usage (0-1)
  • pve_memory_usage_bytes / pve_memory_size_bytes - Memory usage
  • pve_disk_usage_bytes / pve_disk_size_bytes - Disk usage
  • pve_up{id="node/pve"} - Host availability

VM/Container Metrics:

  • pve_up{id=~"(qemu|lxc)/.*"} - VM/Container status
  • pve_cpu_usage_ratio{id=~"qemu/.*"} - Per-VM CPU
  • pve_memory_usage_bytes{id=~"qemu/.*"} - Per-VM memory
  • pve_guest_info - VM/Container metadata

Storage Metrics:

  • pve_disk_usage_bytes{id=~"storage/.*"} - Storage pool usage
  • pve_storage_info - Storage pool information

Image placeholder: Screenshot of Grafana showing multiple metric panels

Alert Rules Summary

AlertThresholdDurationChannelSeverity
ProxmoxHighCPU>85%5 minSlackWarning
ProxmoxCriticalCPU>95%2 minSlackCritical
ProxmoxHighMemory>85%5 minSlackWarning
ProxmoxCriticalMemory>95%2 minSlackCritical
ProxmoxHighDisk>80%10 minSlackWarning
ProxmoxStorageHigh>80%10 minTelegramWarning
ProxmoxVMDown== 05 minTelegramWarning
ProxmoxVMHighCPU>90%10 minTelegramWarning

Access URLs

ServiceURLCredentials
Grafanahttp://192.168.100.40:3000admin / admin
Prometheushttp://192.168.100.40:9090-
AlertManagerhttp://192.168.100.40:9093-
PVE Exporterhttp://192.168.100.40:9221/metrics-

Maintenance

Check service status:

systemctl status grafana-server prometheus loki alertmanager prometheus-pve-exporter

View logs:

journalctl -u prometheus -f
journalctl -u alertmanager -f
journalctl -u prometheus-pve-exporter -f

Backup configurations:

tar -czf grafana-stack-backup.tar.gz \
  /etc/prometheus \
  /etc/loki \
  /etc/alertmanager \
  /etc/grafana \
  /etc/prometheus-pve-exporter

Update retention:

# Prometheus (default: 15 days)
nano /etc/systemd/system/prometheus.service
# Edit: --storage.tsdb.retention.time=30d
systemctl daemon-reload
systemctl restart prometheus

Troubleshooting

No metrics in Grafana:

  • Check Prometheus targets: http://192.168.100.40:9090/targets
  • Verify PVE Exporter: curl http://localhost:9221/metrics | grep pve_up
  • Check Proxmox permissions: pveum acl list | grep grafana-user

Alerts not firing:

PVE Exporter shows pve_up = 0:

  • Verify Proxmox is accessible: curl -k https://192.168.100.4:8006
  • Check API token is correct in /etc/prometheus-pve-exporter/pve.yml
  • Verify user has PVEAuditor role: pveum acl list

Import these additional Grafana dashboards:

  • 10347 - Proxmox VE (official, comprehensive)
  • 13865 - JMeter Load Testing (if using JMeter)
  • 1860 - Node Exporter (Linux host metrics)
  • 15356 - Proxmox Multi-Server (for multiple hosts)

Conclusion

You now have a complete monitoring solution for Proxmox VE with:

  • ✅ Real-time metrics visualization in Grafana
  • ✅ 15 days of metrics history in Prometheus
  • ✅ 7 days of log storage in Loki
  • ✅ Smart alert routing (Slack for critical, Telegram for operational)
  • ✅ Comprehensive dashboards for host, VM, and storage monitoring

This setup provides enterprise-grade monitoring for your Proxmox infrastructure with minimal resource overhead (4 CPU, 8GB RAM for the entire stack).

Resources