.. _minio-metrics-collect-using-prometheus: ======================================== Monitoring and Alerting using Prometheus ======================================== .. default-domain:: minio .. contents:: Table of Contents :local: :depth: 1 .. container:: extlinks-video - `Monitoring with MinIO and Prometheus: Overview `__ - `Monitoring with MinIO and Prometheus: Lab `__ MinIO publishes cluster, node, bucket, and resource metrics using the :prometheus-docs:`Prometheus Data Model `. The procedure on this page documents the following: - Configuring a Prometheus service to scrape and display metrics from a MinIO deployment - Configuring an Alert Rule on a MinIO Metric to trigger an AlertManager action These instructions use :ref:`version 2 metrics. ` For more about metrics API versions, see :ref:`Metrics and alerts. ` .. admonition:: Prerequisites :class: note This procedure requires the following: - An existing :prometheus-docs:`Prometheus deployment ` with backing :prometheus-docs:`Alert Manager ` - An existing MinIO deployment with network access to the Prometheus deployment - An :mc:`mc` installation on your local host configured to :ref:`access ` the MinIO deployment Configure Prometheus to Collect and Alert using MinIO Metrics ------------------------------------------------------------- 1) Generate the Scrape Configuration ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Use the :mc:`mc admin prometheus generate` command to generate the scrape configuration for use by Prometheus in making scraping requests: .. tab-set:: .. tab-item:: MinIO Server The following command scrapes metrics for the MinIO cluster. .. code-block:: shell :class: copyable mc admin prometheus generate ALIAS Replace :mc-cmd:`ALIAS ` with the :mc:`alias ` of the MinIO deployment. The command returns output similar to the following: .. code-block:: yaml :class: copyable global: scrape_interval: 60s scrape_configs: - job_name: minio-job bearer_token: TOKEN metrics_path: /minio/v2/metrics/cluster scheme: https static_configs: - targets: [minio.example.net] .. tab-item:: Nodes The following command scrapes metrics for a node on the MinIO Server. .. code-block:: shell :class: copyable mc admin prometheus generate ALIAS node Replace :mc-cmd:`ALIAS ` with the :mc:`alias ` of the MinIO deployment. .. code-block:: yaml :class: copyable global: scrape_interval: 60s scrape_configs: - job_name: minio-job-node bearer_token: TOKEN metrics_path: /minio/v2/metrics/node scheme: https static_configs: - targets: [minio-1.example.net, minio-2.example.net, minio-N.example.net] .. tab-item:: Buckets The following command scrapes metrics for buckets on the MinIO Server. .. code-block:: shell :class: copyable mc admin prometheus generate ALIAS bucket Replace :mc-cmd:`ALIAS ` with the :mc:`alias ` of the MinIO deployment. .. code-block:: yaml :class: copyable global: scrape_interval: 60s scrape_configs: - job_name: minio-job-bucket bearer_token: TOKEN metrics_path: /minio/v2/metrics/bucket scheme: https static_configs: - targets: [minio.example.net] .. tab-item:: Resources .. versionadded:: RELEASE.2023-10-07T15-07-38Z The following command scrapes metrics for resources on the MinIO Server. .. code-block:: shell :class: copyable mc admin prometheus generate ALIAS resource Replace :mc-cmd:`ALIAS ` with the :mc:`alias ` of the MinIO deployment. .. code-block:: yaml :class: copyable global: scrape_interval: 60s scrape_configs: - job_name: minio-job-resource bearer_token: TOKEN metrics_path: /minio/v2/metrics/resource scheme: https static_configs: - targets: [minio.example.net] - Set an appropriate ``scrape_interval`` value to ensure each scraping operation completes before the next one begins. The recommended value is 60 seconds. Some deployments require a longer scrape interval due to the number of metrics being scraped. To reduce the load on your MinIO and Prometheus servers, choose the longest interval that meets your monitoring requirements. - Set the ``job_name`` to a value associated to the MinIO deployment. Use a unique value to ensure isolation of the deployment metrics from any others collected by that Prometheus service. - MinIO deployments started with :envvar:`MINIO_PROMETHEUS_AUTH_TYPE` set to ``"public"`` can omit the ``bearer_token`` field. - Set the ``scheme`` to http for MinIO deployments not using TLS. - Set the ``targets`` array with a hostname that resolves to the MinIO deployment. This can be any single node, or a load balancer/proxy which handles connections to the MinIO nodes. For MinIO Tenants on Kubernetes infrastructure, when using a Prometheus cluster in that same cluster you can specify the service DNS name for the ``minio`` service. You can otherwise specify the ingress or load balancer endpoint configured to route connections to and from the MinIO Tenant. 2) Restart Prometheus with the Updated Configuration ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Append the desired ``scrape_configs`` job generated in the previous step to the configuration file: .. tab-set:: .. tab-item:: Cluster Cluster metrics aggregate node-level metrics and, where appropriate, attach labels to metrics for the originating node. .. code-block:: yaml :class: copyable global: scrape_interval: 60s scrape_configs: - job_name: minio-job bearer_token: TOKEN metrics_path: /minio/v2/metrics/cluster scheme: https static_configs: - targets: [minio.example.net] .. tab-item:: Nodes Node metrics are specific for node-level monitoring. You need to list all MinIO nodes for this configuration. .. code-block:: yaml :class: copyable global: scrape_interval: 60s scrape_configs: - job_name: minio-job-node bearer_token: TOKEN metrics_path: /minio/v2/metrics/node scheme: https static_configs: - targets: [minio-1.example.net, minio-2.example.net, minio-N.example.net] .. tab-item:: Bucket .. code-block:: yaml :class: copyable global: scrape_interval: 60s scrape_configs: - job_name: minio-job-bucket bearer_token: TOKEN metrics_path: /minio/v2/metrics/bucket scheme: https static_configs: - targets: [minio.example.net] .. tab-item:: Resource .. code-block:: yaml :class: copyable global: scrape_interval: 60s scrape_configs: - job_name: minio-job-resource bearer_token: TOKEN metrics_path: /minio/v2/metrics/resource scheme: https static_configs: - targets: [minio.example.net] Start the Prometheus cluster using the configuration file: .. code-block:: shell :class: copyable prometheus --config.file=prometheus.yaml 3) Analyze Collected Metrics ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Prometheus includes an :prometheus-docs:`expression browser `. You can execute queries here to analyze the collected metrics. .. tab-set:: .. tab-item:: Examples The following query examples return metrics collected by Prometheus every five minutes for a scrape job named ``minio-job``: .. code-block:: shell :class: copyable minio_node_drive_free_bytes{job="minio-job"}[5m] minio_node_drive_free_inodes{job="minio-job"}[5m] minio_node_drive_latency_us{job="minio-job"}[5m] minio_node_drive_offline_total{job="minio-job"}[5m] minio_node_drive_online_total{job="minio-job"}[5m] minio_node_drive_total{job="minio-job"}[5m] minio_node_drive_total_bytes{job="minio-job"}[5m] minio_node_drive_used_bytes{job="minio-job"}[5m] minio_node_drive_errors_timeout{job="minio-job"}[5m] minio_node_drive_errors_availability{job="minio-job"}[5m] minio_node_drive_io_waiting{job="minio-job"}[5m] .. tab-item:: Recommended Metrics MinIO recommends the following as a basic set of metrics to monitor. See :ref:`minio-metrics-and-alerts` for information about all available metrics. .. list-table:: :header-rows: 1 :widths: 40 60 :width: 100% * - Metric - Description * - ``minio_node_drive_free_bytes`` - Total storage available on a drive. * - ``minio_node_drive_free_inodes`` - Total free inodes. * - ``minio_node_drive_latency_us`` - Average last minute latency in µs for drive API storage operations. * - ``minio_node_drive_offline_total`` - Total drives offline in this node. * - ``minio_node_drive_online_total`` - Total drives online in this node. * - ``minio_node_drive_total`` - Total drives in this node. * - ``minio_node_drive_total_bytes`` - Total storage on a drive. * - ``minio_node_drive_used_bytes`` - Total storage used on a drive. * - ``minio_node_drive_errors_timeout`` - Total number of drive timeout errors since server start. * - ``minio_node_drive_errors_availability`` - Total number of drive I/O errors, permission denied and timeouts since server start. * - ``minio_node_drive_io_waiting`` - Total number of I/O operations waiting on drive. 4) Configure an Alert Rule using MinIO Metrics ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ You must configure :prometheus-docs:`Alert Rules ` on the Prometheus deployment to trigger alerts based on collected MinIO metrics. The following example alert rule files provide a baseline of alerts for a MinIO deployment. You can modify or otherwise use these examples as guidance in building your own alerts. .. code-block:: yaml :class: copyable groups: - name: minio-alerts rules: - alert: NodesOffline expr: avg_over_time(minio_cluster_nodes_offline_total{job="minio-job"}[5m]) > 0 for: 10m labels: severity: warn annotations: summary: "Node down in MinIO deployment" description: "Node(s) in cluster {{ $labels.instance }} offline for more than 5 minutes" - alert: DisksOffline expr: avg_over_time(minio_cluster_drive_offline_total{job="minio-job"}[5m]) > 0 for: 10m labels: severity: warn annotations: summary: "Disks down in MinIO deployment" description: "Disks(s) in cluster {{ $labels.instance }} offline for more than 5 minutes" In the Prometheus configuration, specify the path to the alert file in the ``rule_files`` key: .. code-block:: yaml rule_files: - minio-alerting.yml Once triggered, Prometheus sends the alert to the configured AlertManager service. Dashboards ---------- MinIO provides Grafana Dashboards to display metrics collected by Prometheus. For more information, see :ref:`minio-grafana`