diff --git a/source/administration/minio-console.rst b/source/administration/minio-console.rst index 0021f092..32ca114e 100644 --- a/source/administration/minio-console.rst +++ b/source/administration/minio-console.rst @@ -283,6 +283,8 @@ Some subsections may not be visible if the authenticated user does not have the Use the :guilabel:`Users` and :guilabel:`Groups` views to assign a created policy to users and groups, respectively. +.. _minio-console-monitoring: + Monitoring ---------- @@ -295,25 +297,23 @@ Some subsections may not be visible if the authenticated user does not have the .. tab-item:: Metrics - .. image:: /images/minio-console/console-metrics.png + .. image:: /images/minio-console/console-metrics-simple.png :width: 600px - :alt: MinIO Console Metrics displaying detailed data using Prometheus + :alt: MinIO Console Metrics displaying point-in-time data :align: center The Console :guilabel:`Dashboard` section displays metrics for the MinIO deployment. - - The Console depends on a :ref:`configured Prometheus service ` to generate the detailed metrics shown above. + The default view provides a high-level overview of the deployment status, including the uptime and availability of individual servers and drives. - The default metrics view provides a high-level overview of the deployment status, including the uptime and availability of individual servers and drives. + The Console also supports displaying time-series and historical data by querying a :prometheus-docs:`Prometheus ` service configured to scrape data from the MinIO deployment. + Specifically, the MinIO Console uses :prometheus-docs:`Prometheus query API ` to retrieve stored metrics data and display historical metrics: - .. image:: /images/minio-console/console-metrics-simple.png + .. image:: /images/minio-console/console-metrics.png :width: 600px :alt: MinIO Console Metrics displaying simplified data :align: center - This view requires configuring a Prometheus service to scrape the deployment metrics. - You can download these metrics as a ``.png`` image or ``.csv`` file. - See :ref:`minio-metrics-collect-using-prometheus` for complete instructions. + See :ref:`minio-console-metrics` for more information on the historical metric visualization. .. tab-item:: Logs diff --git a/source/default-conf.py b/source/default-conf.py index e5ec7d5f..d5b3a3be 100644 --- a/source/default-conf.py +++ b/source/default-conf.py @@ -79,6 +79,7 @@ extlinks = { 'podman-git' : ('https://github.com/containers/podman/%s',''), 'docker-docs' : ('https://docs.docker.com/%s', ''), 'openshift-docs' : ('https://docs.openshift.com/container-platform/4.11/%s', ''), + 'influxdb-docs' : ('https://docs.influxdata.com/influxdb/v2.4/%s',''), } diff --git a/source/images/minio-console/console-metrics-simple.png b/source/images/minio-console/console-metrics-simple.png index c6df840f..9f58d46e 100644 Binary files a/source/images/minio-console/console-metrics-simple.png and b/source/images/minio-console/console-metrics-simple.png differ diff --git a/source/images/minio-console/console-metrics.png b/source/images/minio-console/console-metrics.png index 1a7fa412..996621bb 100644 Binary files a/source/images/minio-console/console-metrics.png and b/source/images/minio-console/console-metrics.png differ diff --git a/source/operations/monitoring.rst b/source/operations/monitoring.rst index 139baf22..d7e436da 100644 --- a/source/operations/monitoring.rst +++ b/source/operations/monitoring.rst @@ -1,5 +1,5 @@ ===================== -Prometheus Monitoring +Monitoring and Alerts ===================== .. default-domain:: minio @@ -12,22 +12,27 @@ Metrics and Alerts ------------------ MinIO provides point-in-time metrics on cluster status and operations. -MinIO publishes collected metrics data using Prometheus-compatible data structures. +The :ref:`MinIO Console ` provides a graphical display of these metrics. -For alerts, time-series metric data, or additional metrics, MinIO can leverage `Prometheus `__. -Prometheus is an Open Source systems and service monitoring system which supports analyzing and alerting based on collected metrics. -The Prometheus ecosystem includes multiple :prometheus-docs:`integrations `, allowing wide latitude in processing and storing collected metrics. +For historical metrics and analytics, MinIO publishes cluster and node metrics using the :prometheus-docs:`Prometheus Data Model `. +You can use any scraping tool which supports that data model to pull metrics data from MinIO for further analysis and alerting. -- MinIO publishes Prometheus-compatible scraping endpoints for cluster and node-level metrics. - Any Prometheus-compatible scraping software can ingest and process MinIO metrics for analysis, visualization, and alerting. - See :ref:`minio-metrics-and-alerts-endpoints` for more information. +The following table lists tutorials for integrating MinIO metrics with select third-party monitoring software. -- For alerts, use Prometheus :prometheus-docs:`Alerting Rules ` and the - :prometheus-docs:`Alert Manager ` to trigger alerts based on collected metrics. - See :ref:`minio-metrics-and-alerts-alerting` for more information. +.. list-table:: + :stub-columns: 1 + :widths: 30 70 + :width: 100% -When configured, the :ref:`MinIO Console ` shows some metrics in the :guilabel:`Monitoring > Metrics` page. -You can download these metrics as either ``.png`` images or ``.csv`` files. + * - :ref:`minio-metrics-collect-using-prometheus` + - Configure Prometheus to Monitor and Alert for a MinIO deployment + + Configure MinIO to query the Prometheus deployment to enable historical metrics via the MinIO Console + + * - :ref:`minio-metrics-influxdb` + - Configure InfluxDB to Monitor and Alert for a MinIO deployment. + +Other metrics and analytics software suites which support the Prometheus data model may work regardless of their inclusion on the above list. Logging ------- @@ -58,6 +63,6 @@ See :ref:`minio-healthcheck-api` for more information. :titlesonly: :hidden: - /operations/monitoring/collect-minio-metrics-using-prometheus + /operations/monitoring/metrics-and-alerts /operations/monitoring/minio-logging /operations/monitoring/healthcheck-probe \ No newline at end of file diff --git a/source/operations/monitoring/collect-minio-metrics-using-prometheus.rst b/source/operations/monitoring/collect-minio-metrics-using-prometheus.rst index c68122b3..19793daa 100644 --- a/source/operations/monitoring/collect-minio-metrics-using-prometheus.rst +++ b/source/operations/monitoring/collect-minio-metrics-using-prometheus.rst @@ -1,9 +1,8 @@ .. _minio-metrics-collect-using-prometheus: -.. _minio-metrics-and-alerts: -====================================== -Collect MinIO Metrics Using Prometheus -====================================== +======================================== +Monitoring and Alerting using Prometheus +======================================== .. default-domain:: minio @@ -11,60 +10,46 @@ Collect MinIO Metrics Using Prometheus :local: :depth: 1 -MinIO leverages `Prometheus `__ for metrics and alerts. -MinIO publishes Prometheus-compatible scraping endpoints for cluster and -node-level metrics. See :ref:`minio-metrics-and-alerts-endpoints` for more -information. +MinIO publishes cluster and node metrics using the :prometheus-docs:`Prometheus Data Model `. +The procedure on this page documents the following: -The procedure on this page documents scraping the MinIO metrics -endpoints using a Prometheus instance, including deploying and configuring -a simple Prometheus server for collecting metrics. +- Configuring a Prometheus service to scrape and display metrics from a MinIO deployment +- Configuring an Alert Rule on a MinIO Metric to trigger an AlertManager action -This procedure is not a replacement for the official -:prometheus-docs:`Prometheus Documentation <>`. Any specific guidance -related to configuring, deploying, and using Prometheus is made on a best-effort -basis. +.. admonition:: Prerequisites + :class: note -Requirements ------------- + This procedure requires the following: -Install and Configure ``mc`` with Access to the MinIO Cluster -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + - An existing Prometheus deployment with backing :prometheus-docs:`Alert Manager ` -This procedure uses :mc:`mc` for performing operations on the MinIO -deployment. Install ``mc`` on a machine with network access to the -deployment. See the ``mc`` :ref:`Installation Quickstart ` for -more complete instructions. + - An existing MinIO deployment with network access to the Prometheus deployment -Prometheus Service -~~~~~~~~~~~~~~~~~~ + - An :mc:`mc` installation on your local host configured to :ref:`access ` the MinIO deployment -This procedure provides instruction for deploying Prometheus for rapid local -evaluation and development. All other environments should have an existing -Prometheus or Prometheus-compatible service with access to the MinIO cluster. +.. cond:: k8s -Procedure ---------- + The MinIO Operator supports deploying a :ref:`per-tenant Prometheus instance ` configured to support metrics and visualizations. + This includes automatically configuring the Tenant to enable the :ref:`Tenant Console historical metric view `. -1) Generate the Bearer Token -~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + You can still use this procedure to configure an external Prometheus service for supporting monitoring and alerting for a MinIO Tenant. + You must configure all necessary network control components, such as Ingress or a Load Balancer, to facilitate access between the Tenant and the Prometheus service. + This procedure assumes your local host machine can access the Tenant via :mc:`mc`. -MinIO by default requires authentication for requests made to the metrics -endpoints. While this step is not required for MinIO deployments started with -:envvar:`MINIO_PROMETHEUS_AUTH_TYPE` set to ``"public"``, you can still use the -command output for retrieving a Prometheus ``scrape_configs`` entry. +Configure Prometheus to Collect and Alert using MinIO Metrics +------------------------------------------------------------- -Use the :mc-cmd:`mc admin prometheus generate` command to generate a -JWT bearer token for use by Prometheus in making authenticated scraping -requests: +1) Generate the Scrape Configuration +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Use the :mc-cmd:`mc admin prometheus generate` command to generate the scrape configuration for use by Prometheus in making scraping requests: .. code-block:: shell :class: copyable mc admin prometheus generate ALIAS -Replace :mc-cmd:`ALIAS ` with the -:mc:`alias ` of the MinIO deployment. +Replace :mc-cmd:`ALIAS ` with the :mc:`alias ` of the MinIO deployment. The command returns output similar to the following: @@ -72,31 +57,29 @@ The command returns output similar to the following: :class: copyable scrape_configs: - - job_name: minio-job + - job_name: minio-job bearer_token: TOKEN metrics_path: /minio/v2/metrics/cluster scheme: https static_configs: - targets: [minio.example.net] -The ``targets`` array can contain the hostname for any node in the deployment. -For clusters with a load balancer managing connections between MinIO nodes, -specify the address of the load balancer. +- Set the ``job_name`` to a value associated to the MinIO deployment. -Specify the output block to the -:prometheus-docs:`scrape_config -` section of -the Prometheus configuration. + Use a unique value to ensure isolation of the deployment metrics from any others collected by that Prometheus service. -2) Configure and Run Prometheus -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +- MinIO deployments started with :envvar:`MINIO_PROMETHEUS_AUTH_TYPE` set to ``"public"`` can omit the ``bearer_token`` field. -Follow the Prometheus :prometheus-docs:`Getting Started -` guide -to download and run Prometheus locally. +- Set the ``scheme`` to http for MinIO deployments not using TLS. -Append the ``scrape_configs`` job generated in the previous step to the -configuration file: +- Set the ``targets`` array with a hostname that resolves to the MinIO deployment. + + This can be any single node, or a load balancer/proxy which handles connections to the MinIO nodes. + +2) Restart Prometheus with the Updated Configuration +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Append the ``scrape_configs`` job generated in the previous step to the configuration file: .. code-block:: yaml :class: copyable @@ -122,10 +105,8 @@ Start the Prometheus cluster using the configuration file: 3) Analyze Collected Metrics ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Prometheus includes a -:prometheus-docs:`expression browser -`. You can -execute queries here to analyze the collected metrics. +Prometheus includes a :prometheus-docs:`expression browser `. +You can execute queries here to analyze the collected metrics. The following query examples return metrics collected by Prometheus: @@ -139,386 +120,65 @@ The following query examples return metrics collected by Prometheus: minio_cluster_capacity_usable_free_bytes{job="minio-job"}[5m] -See :ref:`minio-metrics-and-alerts-available-metrics` for a complete -list of published metrics. +See :ref:`minio-metrics-and-alerts-available-metrics` for a complete list of published metrics. -.. _minio-console-metrics: +4) Configure an Alert Rule using MinIO Metrics +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -4) Visualize Collected Metrics -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +You must configure :prometheus-docs:`Alert Rules ` on the Prometheus deployment to trigger alerts based on collected MinIO metrics. -The :minio-git:`MinIO Console ` supports visualizing collected metrics from Prometheus. -Specify the URL of the Prometheus service to the :envvar:`MINIO_PROMETHEUS_URL` environment variable to each MinIO server in the deployment: - -.. code-block:: shell - :class: copyable - - export MINIO_PROMETHEUS_URL="https://prometheus.example.net" - -If you set a custom ``job_name`` for the Prometheus scraping job, you must also set :envvar:`MINIO_PROMETHEUS_JOB_ID` to match that job name. - -Restart the deployment using :mc-cmd:`mc admin service restart` to apply the changes. - -The MinIO Console uses the metrics collected by the ``minio-job`` scraping job to populate the Dashboard metrics available from :guilabel:`Monitoring > Metrics`. -You can download the metrics from the MinIO Console as either a ``.png`` image or a ``.csv`` file. - -.. image:: /images/minio-console/console-metrics.png - :width: 600px - :alt: MinIO Console Dashboard displaying Monitoring Data - :align: center - -MinIO also publishes a `Grafana Dashboard `_ for visualizing collected metrics. -For more complete documentation on configuring a Prometheus data source for Grafana, see :prometheus-docs:`Grafana Support for Prometheus `. - -Prometheus includes a :prometheus-docs:`graphing interface ` for visualizing collected metrics. - -.. _minio-metrics-and-alerts-endpoints: - -Metrics -------- - -MinIO provides a scraping endpoint for cluster-level metrics: - -.. code-block:: shell - :class: copyable - - http://minio.example.net:9000/minio/v2/metrics/cluster - -Replace ``http://minio.example.net`` with the hostname of any node in the MinIO -deployment. For deployments with a load balancer managing connections between -MinIO nodes, specify the address of the load balancer. - -Create a new :prometheus-docs:`scraping configuration -` to begin -collecting metrics from the MinIO deployment. See -:ref:`minio-metrics-collect-using-prometheus` for a complete tutorial. - -The following example describes a ``scrape_configs`` entry for collecting -cluster metrics. +The following example alert rule files provide a baseline of alerts for a MinIO deployment. +You can modify or otherwise use these examples as guidance in building your own alerts. .. code-block:: yaml :class: copyable - scrape_configs: - - job_name: minio-job - bearer_token: - metrics_path: /minio/v2/metrics/cluster - scheme: https - static_configs: - - targets: ['minio.example.net:9000'] + groups: + - name: minio-alerts + rules: + - alert: NodesOffline + expr: avg_over_time(minio_cluster_nodes_offline_total{job="minio-job"}[5m]) > 0 + for: 10m + labels: + severity: warn + annotations: + summary: "Node down in MinIO deployment" + description: "Node(s) in cluster {{ $labels.instance }} offline for more than 5 minutes" -.. list-table:: - :stub-columns: 1 - :widths: 20 80 - :width: 100% + - alert: DisksOffline + expr: avg_over_time(minio_cluster_disk_offline_total{job="minio-job"}[5m]) > 0 + for: 10m + labels: + severity: warn + annotations: + summary: "Disks down in MinIO deployment" + description: "Disks(s) in cluster {{ $labels.instance }} offline for more than 5 minutes" - * - ``job_name`` - - The name of the scraping job. +Specify the path to the alert file to the Prometheus configuration as part of the ``rule_files`` key: - * - ``bearer_token`` - - The JWT token generated by :mc-cmd:`mc admin prometheus generate`. +.. code-block:: yaml - Omit this field if the MinIO deployment was started with - :envvar:`MINIO_PROMETHEUS_AUTH_TYPE` set to ``public``. + global: + scrape_interval: 5s - * - ``targets`` - - The endpoint for the MinIO deployment. You can specify any node in the - deployment for collecting cluster metrics. For clusters with a load - balancer managing connections between MinIO nodes, specify the - address of the load balancer. + rule_files: + - minio-alerting.yml -MinIO by default requires authentication for scraping the metrics endpoints. -Use the :mc-cmd:`mc admin prometheus generate` command to generate the -necessary bearer tokens for use with configuring the -``scrape_configs.bearer_token`` field. You can alternatively disable -metrics endpoint authentication by setting -:envvar:`MINIO_PROMETHEUS_AUTH_TYPE` to ``public``. +Once triggered, Prometheus sends the alert to the configured AlertManager service. -Visualizing Metrics -~~~~~~~~~~~~~~~~~~~ +5) (Optional) Configure MinIO Console to Query Prometheus +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -The MinIO Console uses the metrics collected by Prometheus to populate the -Dashboard metrics: +The Console also supports displaying time-series and historical data by querying a :prometheus-docs:`Prometheus ` service configured to scrape data from the MinIO deployment. .. image:: /images/minio-console/console-metrics.png :width: 600px :alt: MinIO Console displaying Prometheus-backed Monitoring Data :align: center -Set the :envvar:`MINIO_PROMETHEUS_URL` environment variable to the URL of the -Prometheus service to allow the Console to retrieve and display collected -metrics. See :ref:`minio-metrics-collect-using-prometheus` for a complete -example. +To enable historical data visualization in MinIO Console, set the following environment variables on each node in the MinIO deployment: -MinIO also publishes a `Grafana Dashboard -`_ for visualizing collected -metrics. For more complete documentation on configuring a Prometheus data source -for Grafana, see :prometheus-docs:`Grafana Support for Prometheus -`. +- Set :envvar:`MINIO_PROMETHEUS_URL` to the URL of the Prometheus service +- Set :envvar:`MINIO_PROMETHEUS_JOB_ID` to the unique job ID assigned to the collected metrics -.. _minio-metrics-and-alerts-available-metrics: - -Available Metrics -~~~~~~~~~~~~~~~~~ - -MinIO publishes the following metrics, where each metric includes a label for -the MinIO server which generated that metric. - -Object Metrics -++++++++++++++ - -.. metric:: minio_bucket_objects_size_distribution - - Distribution of object sizes in the bucket, includes label for the bucket - name. - -Replication Metrics -+++++++++++++++++++ - -These metrics are only populated for MinIO clusters with -:ref:`minio-bucket-replication-serverside` enabled. - -.. metric:: minio_bucket_replication_failed_bytes - - Total number of bytes failed at least once to replicate. - -.. metric:: minio_bucket_replication_pending_bytes - - Total bytes pending to replicate. - -.. metric:: minio_bucket_replication_received_bytes - - Total number of bytes replicated to this bucket from another source bucket. - -.. metric:: minio_bucket_replication_sent_bytes - - Total number of bytes replicated to the target bucket. - -.. metric:: minio_bucket_replication_pending_count - - Total number of replication operations pending for this bucket. - -.. metric:: minio_bucket_replication_failed_count - - Total number of replication operations failed for this bucket. - -Bucket Metrics -++++++++++++++ - -.. metric:: minio_bucket_usage_object_total - - Total number of objects - -.. metric:: minio_bucket_usage_total_bytes - - Total bucket size in bytes - -Cache Metrics -+++++++++++++ - -.. metric:: minio_cache_hits_total - - Total number of disk cache hits - -.. metric:: minio_cache_missed_total - - Total number of disk cache misses - -.. metric:: minio_cache_sent_bytes - - Total number of bytes served from cache - -.. metric:: minio_cache_total_bytes - - Total size of cache disk in bytes - -.. metric:: minio_cache_usage_info - - Total percentage cache usage, value of 1 indicates high and 0 low, label - level is set as well - -.. metric:: minio_cache_used_bytes - - Current cache usage in bytes - -Cluster Metrics -+++++++++++++++ - -.. metric:: minio_cluster_capacity_raw_free_bytes - - Total free capacity online in the cluster. - -.. metric:: minio_cluster_capacity_raw_total_bytes - - Total capacity online in the cluster. - -.. metric:: minio_cluster_capacity_usable_free_bytes - - Total free usable capacity online in the cluster. - -.. metric:: minio_cluster_capacity_usable_total_bytes - - Total usable capacity online in the cluster. - -Node Metrics -++++++++++++ - -.. metric:: minio_cluster_nodes_offline_total - - Total number of MinIO nodes offline. - -.. metric:: minio_cluster_nodes_online_total - - Total number of MinIO nodes online. - -.. metric:: minio_heal_objects_error_total - - Objects for which healing failed in current self healing run - -.. metric:: minio_heal_objects_heal_total - - Objects healed in current self healing run - -.. metric:: minio_heal_objects_total - - Objects scanned in current self healing run - -.. metric:: minio_heal_time_last_activity_nano_seconds - - Time elapsed (in nano seconds) since last self healing activity. This is set - to -1 until initial self heal - -.. metric:: minio_inter_node_traffic_received_bytes - - Total number of bytes received from other peer nodes. - -.. metric:: minio_inter_node_traffic_sent_bytes - - Total number of bytes sent to the other peer nodes. - -.. metric:: minio_node_disk_free_bytes - - Total storage available on a disk. - -.. metric:: minio_node_disk_total_bytes - - Total storage on a disk. - -.. metric:: minio_node_disk_used_bytes - - Total storage used on a disk. - -.. metric:: minio_node_file_descriptor_limit_total - - Limit on total number of open file descriptors for the MinIO Server process. - -.. metric:: minio_node_file_descriptor_open_total - - Total number of open file descriptors by the MinIO Server process. - -.. metric:: minio_node_io_rchar_bytes - - Total bytes read by the process from the underlying storage system including - cache, ``/proc/[pid]/io rchar`` - -.. metric:: minio_node_io_read_bytes - - Total bytes read by the process from the underlying storage system, - ``/proc/[pid]/io read_bytes`` - -.. metric:: minio_node_io_wchar_bytes - - Total bytes written by the process to the underlying storage system including - page cache, ``/proc/[pid]/io wchar`` - -.. metric:: minio_node_io_write_bytes - - Total bytes written by the process to the underlying storage system, - ``/proc/[pid]/io write_bytes`` - -.. metric:: minio_node_process_starttime_seconds - - Start time for MinIO process per node, time in seconds since Unix epoch. - -.. metric:: minio_node_process_uptime_seconds - - Uptime for MinIO process per node in seconds. - -.. metric:: minio_node_scanner_bucket_scans_finished - - Total number of bucket scans finished since server start. - -.. metric:: minio_node_scanner_bucket_scans_started - - Total number of bucket scans started since server start. - -.. metric:: minio_node_scanner_directories_scanned - - Total number of directories scanned since server start. - -.. metric:: minio_node_scanner_objects_scanned - - Total number of unique objects scanned since server start. - -.. metric:: minio_node_scanner_versions_scanned - - Total number of object versions scanned since server start. - -.. metric:: minio_node_syscall_read_total - - Total read SysCalls to the kernel. ``/proc/[pid]/io syscr`` - -.. metric:: minio_node_syscall_write_total - - Total write SysCalls to the kernel. ``/proc/[pid]/io syscw`` - -S3 Metrics -++++++++++ - -.. metric:: minio_s3_requests_error_total - - Total number S3 requests with errors - -.. metric:: minio_s3_requests_inflight_total - - Total number of S3 requests currently in flight - -.. metric:: minio_s3_requests_total - - Total number S3 requests - -.. metric:: minio_s3_time_ttbf_seconds_distribution - - Distribution of the time to first byte across API calls. - -.. metric:: minio_s3_traffic_received_bytes - - Total number of s3 bytes received. - -.. metric:: minio_s3_traffic_sent_bytes - - Total number of s3 bytes sent - -Software Metrics -++++++++++++++++ - -.. metric:: minio_software_commit_info - - Git commit hash for the MinIO release. - -.. metric:: minio_software_version_info - - MinIO Release tag for the server - -.. _minio-metrics-and-alerts-alerting: - -Alerts ------- - -You can configure alerts using Prometheus :prometheus-docs:`Alerting Rules -` based on the collected MinIO -metrics. The Prometheus :prometheus-docs:`Alert Manager -` supports managing alerts produced by the configured -alerting rules. Prometheus also supports a :prometheus-docs:`Webhook Receiver -` for publishing alerts -to mechanisms not supported by Prometheus AlertManager. \ No newline at end of file +Restart the MinIO deployment and visit the :ref:`Monitoring ` pane to see the historical data views. diff --git a/source/operations/monitoring/metrics-and-alerts.rst b/source/operations/monitoring/metrics-and-alerts.rst new file mode 100644 index 00000000..a0370c46 --- /dev/null +++ b/source/operations/monitoring/metrics-and-alerts.rst @@ -0,0 +1,377 @@ +.. _minio-metrics-and-alerts-endpoints: +.. _minio-metrics-and-alerts-alerting: +.. _minio-metrics-and-alerts: + +================== +Metrics and Alerts +================== + +.. default-domain:: minio + +.. contents:: Table of Contents + :local: + :depth: 2 + +MinIO publishes cluster and node metrics using the :prometheus-docs:`Prometheus Data Model `. +You can use any scraping tool to pull metrics data from MinIO for further analysis and alerting. + +MinIO provides a scraping endpoint for cluster-level metrics: + +.. code-block:: shell + :class: copyable + + http://minio.example.net:9000/minio/v2/metrics/cluster + +Replace ``http://minio.example.net`` with the hostname of any node in the MinIO deployment. +For deployments with a load balancer managing connections between MinIO nodes, specify the address of the load balancer. + +MinIO by default requires authentication for scraping the metrics endpoints. +Use the :mc-cmd:`mc admin prometheus generate` command to generate the necessary bearer tokens. +You can alternatively disable metrics endpoint authentication by setting :envvar:`MINIO_PROMETHEUS_AUTH_TYPE` to ``public``. + +.. _minio-console-metrics: + +MinIO Console Metrics Dashboard +------------------------------- + +The :ref:`MinIO Console ` provides a point-in-time metrics dashboard by default: + +.. image:: /images/minio-console/console-metrics-simple.png + :width: 600px + :alt: MinIO Console with Point-In-Time Metrics + :align: center + +The Console also supports displaying time-series and historical data by querying a :prometheus-docs:`Prometheus ` service configured to scrape data from the MinIO deployment. +Specifically, the MinIO Console uses :prometheus-docs:`Prometheus query API ` to retrieve stored metrics data and display the following visualizations: + +- :guilabel:`Usage` - provides historical and on-demand visualization of overall usage and status +- :guilabel:`Traffic` - provides historical and on-demand visualization of network traffic +- :guilabel:`Resources` - provides historical and on-demand visualization of resources (compute and storage) +- :guilabel:`Info` - provides point-in-time status of the deployment + +.. image:: /images/minio-console/console-metrics.png + :width: 600px + :alt: MinIO Console displaying Prometheus-backed Monitoring Data + :align: center + +.. cond:: k8s + + The MinIO Operator supports deploying a per-tenant Prometheus instance configured to support metrics and visualization. + + If you deploy the Tenant with this feature disabled *but* still want the historical metric views, you can instead configure an external Prometheus service to scrape the Tenant metrics. + Once configured, you can update the Tenant to query that Prometheus service to retrieve metric data: + +.. cond:: linux or container or macos or windows + + To enable historical data visualization in MinIO Console, set the following environment variables on each node in the MinIO deployment: + +- Set :envvar:`MINIO_PROMETHEUS_URL` to the URL of the Prometheus service +- Set :envvar:`MINIO_PROMETHEUS_JOB_ID` to the unique job ID assigned to the collected metrics + +MinIO also publishes a `Grafana Dashboard `_ for visualizing collected metrics. +For more complete documentation on configuring a Prometheus-compatible data source for Grafana, see :prometheus-docs:`Grafana Support for Prometheus `. + +.. _minio-metrics-and-alerts-available-metrics: + +Available Metrics +----------------- + +MinIO publishes the following metrics, where each metric includes a label for +the MinIO server which generated that metric. + +Object and Bucket Metrics +~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. metric:: minio_bucket_objects_size_distribution + + Distribution of object sizes in a given bucket. + You can identify the bucket using the ``{ bucket="STRING" }`` label. + +.. metric:: minio_bucket_usage_object_total + + Total number of objects in a given bucket. + You can identify the bucket using the ``{ bucket="STRING" }`` label. + +.. metric:: minio_bucket_usage_total_bytes + + Total bucket size in bytes in a given bucket. + You can identify the bucket using the ``{ bucket="STRING" }`` label. + +Replication Metrics +~~~~~~~~~~~~~~~~~~~ + +These metrics are only populated for MinIO clusters with +:ref:`minio-bucket-replication-serverside` enabled. + +.. metric:: minio_bucket_replication_failed_bytes + + Total number of bytes that failed at least once to replicate for a given bucket. + You can identify the bucket using the ``{ bucket="STRING" }`` label + +.. metric:: minio_bucket_replication_pending_bytes + + Total number of bytes pending to replicate for a given bucket. + You can identify the bucket using the ``{ bucket="STRING" }`` label + +.. metric:: minio_bucket_replication_received_bytes + + Total number of bytes replicated to this bucket from another source bucket. + You can identify the bucket using the ``{ bucket="STRING" }`` label. + +.. metric:: minio_bucket_replication_sent_bytes + + Total number of bytes replicated to the target bucket. + You can identify the bucket using the ``{ bucket="STRING" }`` label. + +.. metric:: minio_bucket_replication_pending_count + + Total number of replication operations pending for a given bucket. + You can identify the bucket using the ``{ bucket="STRING" }`` label. + +.. metric:: minio_bucket_replication_failed_count + + Total number of replication operations failed for a given bucket. + You can identify the bucket using the ``{ bucket="STRING" }`` label. + +Capacity Metrics +~~~~~~~~~~~~~~~~ + +.. metric:: minio_cluster_capacity_raw_free_bytes + + Total free capacity online in the cluster. + +.. metric:: minio_cluster_capacity_raw_total_bytes + + Total capacity online in the cluster. + +.. metric:: minio_cluster_capacity_usable_free_bytes + + Total free usable capacity online in the cluster. + +.. metric:: minio_cluster_capacity_usable_total_bytes + + Total usable capacity online in the cluster. + +.. metric:: minio_node_disk_free_bytes + + Total storage available on a specific drive for a node in the MinIO deployment. + You can identify the drive and node using the ``{ disk="/path/to/disk",server="STRING"}`` labels respectively. + +.. metric:: minio_node_disk_total_bytes + + Total storage on a specific drive for a node in the MinIO deployment. + You can identify the drive and node using the ``{ disk="/path/to/disk",server="STRING"}`` labels respectively. + +.. metric:: minio_node_disk_used_bytes + + Total storage used on a specific drive for a node in a MinIO deployment. + You can identify the drive and node using the ``{ disk="/path/to/disk",server="STRING"}`` labels respectively. + +Lifecycle Management Metrics +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. metric:: minio_cluster_ilm_transitioned_bytes + + Total number of bytes transitioned using :ref:`tiering/transition lifecycle management rules ` + + +.. metric:: minio_cluster_ilm_transitioned_objects + + Total number of objects transitioned using :ref:`tiering/transition lifecycle management rules ` + +.. metric:: minio_cluster_ilm_transitioned_versions + + Total number of non-current object versions transitioned using :ref:`tiering/transition lifecycle management rules ` + +.. metric:: minio_node_ilm_transition_pending_tasks + + Total number of pending :ref:`object transition ` tasks + +.. metric:: minio_node_ilm_expiry_pending_tasks + + Total number of pending :ref:`object expiration ` tasks + +.. metric:: minio_node_ilm_expiry_active_tasks + + Total number of active :ref:`object expiration ` tasks + +Node and Disk Health Metrics +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. metric:: minio_cluster_disk_online_total + + The total number of disks online + +.. metric:: minio_cluster_disk_offline_total + + The total number of disks offline + +.. metric:: minio_cluster_disk_total + + The total number of disks + +.. metric:: minio_cluster_nodes_offline_total + + Total number of MinIO nodes offline. + +.. metric:: minio_cluster_nodes_online_total + + Total number of MinIO nodes online. + +.. metric:: minio_heal_objects_error_total + + Objects for which healing failed in current self healing run + +.. metric:: minio_heal_objects_heal_total + + Objects healed in current self healing run + +.. metric:: minio_heal_objects_total + + Objects scanned in current self healing run + +.. metric:: minio_heal_time_last_activity_nano_seconds + + Time elapsed (in nano seconds) since last self healing activity. This is set + to -1 until initial self heal + +Scanner Metrics +~~~~~~~~~~~~~~~ + +.. metric:: minio_node_scanner_bucket_scans_finished + + Total number of bucket scans finished since server start. + +.. metric:: minio_node_scanner_bucket_scans_started + + Total number of bucket scans started since server start. + +.. metric:: minio_node_scanner_directories_scanned + + Total number of directories scanned since server start. + +.. metric:: minio_node_scanner_objects_scanned + + Total number of unique objects scanned since server start. + +.. metric:: minio_node_scanner_versions_scanned + + Total number of object versions scanned since server start. + +.. metric:: minio_node_syscall_read_total + + Total number of read SysCalls to the kernel. ``/proc/[pid]/io syscr`` + +.. metric:: minio_node_syscall_write_total + + Total number of write SysCalls to the kernel. ``/proc/[pid]/io syscw`` + +S3 Metrics +~~~~~~~~~~ + +.. metric:: minio_bucket_traffic_sent_bytes + + Total number of bytes of S3 traffic sent per bucket. + You can identify the bucket using the ``{ bucket="STRING" }`` label. + +.. metric:: minio_bucket_traffic_received_bytes + + Total number of bytes of S3 traffic received per bucket. + You can identify the bucket using the ``{ bucket="STRING" }`` label. + +.. metric:: minio_s3_requests_inflight_total + + Total number of S3 requests currently in flight. + +.. metric:: minio_s3_requests_total + + Total number of S3 requests. + +.. metric:: minio_s3_time_ttfb_seconds_distribution + + Distribution of the time to first byte across API calls. + +.. metric:: minio_s3_traffic_received_bytes + + Total number of S3 bytes received. + +.. metric:: minio_s3_traffic_sent_bytes + + Total number of S3 bytes sent. + +.. metric:: minio_s3_requests_errors_total + + Total number of S3 requests with 4xx and 5xx errors. + +.. metric:: minio_s3_requests_4xx_errors_total + + Total number of S3 requests with 4xx errors. + +.. metric:: minio_s3_requests_5xx_errors_total + + Total number of S3 requests with 5xx errors. + +Internal Metrics +~~~~~~~~~~~~~~~~ + +.. metric:: minio_inter_node_traffic_received_bytes + + Total number of bytes received from other peer nodes. + +.. metric:: minio_inter_node_traffic_sent_bytes + + Total number of bytes sent to the other peer nodes. + +.. metric:: minio_node_file_descriptor_limit_total + + Limit on total number of open file descriptors for the MinIO Server process. + +.. metric:: minio_node_file_descriptor_open_total + + Total number of open file descriptors by the MinIO Server process. + +.. metric:: minio_node_io_rchar_bytes + + Total bytes read by the process from the underlying storage system including + cache, ``/proc/[pid]/io rchar`` + +.. metric:: minio_node_io_read_bytes + + Total bytes read by the process from the underlying storage system, + ``/proc/[pid]/io read_bytes`` + +.. metric:: minio_node_io_wchar_bytes + + Total bytes written by the process to the underlying storage system including + page cache, ``/proc/[pid]/io wchar`` + +.. metric:: minio_node_io_write_bytes + + Total bytes written by the process to the underlying storage system, + ``/proc/[pid]/io write_bytes`` + +Software and Process Metrics +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. metric:: minio_software_commit_info + + Git commit hash for the MinIO release. + +.. metric:: minio_software_version_info + + MinIO Release tag for the server + +.. metric:: minio_node_process_starttime_seconds + + Start time for MinIO process per node, time in seconds since Unix epoch. + +.. metric:: minio_node_process_uptime_seconds + + Uptime for MinIO process per node in seconds. + +.. toctree:: + :titlesonly: + :hidden: + + /operations/monitoring/collect-minio-metrics-using-prometheus + /operations/monitoring/monitor-and-alert-using-influxdb \ No newline at end of file diff --git a/source/operations/monitoring/monitor-and-alert-using-influxdb.rst b/source/operations/monitoring/monitor-and-alert-using-influxdb.rst new file mode 100644 index 00000000..5a5e27c0 --- /dev/null +++ b/source/operations/monitoring/monitor-and-alert-using-influxdb.rst @@ -0,0 +1,121 @@ +.. _minio-metrics-influxdb: + +====================================== +Monitoring and Alerting using InfluxDB +====================================== + +.. default-domain:: minio + +.. contents:: Table of Contents + :local: + :depth: 1 + +MinIO publishes cluster and node metrics using the :prometheus-docs:`Prometheus Data Model `. +`InfluxDB `__ supports scraping MinIO metrics data for monitoring and alerting. + +The procedure on this page documents the following: + +- Configuring an InfluxDB service to scrape and display metrics from a MinIO deployment +- Configuring an Alert on a MinIO metric + +.. admonition:: Prerequisites + :class: note + + This procedure requires the following: + + - An existing InfluxDB deployment configured with one or more :influxdb-docs:`notification endpoints ` + - An existing MinIO deployment with network access to the InfluxDB deployment + - An :mc:`mc` installation on your local host configured to :ref:`access ` the MinIO deployment + +.. cond:: k8s + + This procedure assumes all necessary network control components, such as Ingress or Load Balancers, to facilitate access between the MinIO Tenant and the InfluxDB service. + +Configure InfluxDB to Collect and Alert using MinIO Metrics +----------------------------------------------------------- + +.. important:: + + This procedure specifically uses the InfluxDB UI to create a scraping endpoint. + + The InfluxDB UI does not provide the same level of configuration as using `Telegraf `__ and the corresponding `Prometheus plugin `__. + Specifically: + + - You cannot enable authenticated access to the MinIO metrics endpoint via the InfluxDB UI + - You cannot set a tag for collected metrics (e.g. ``url_tag``) for uniquely identifying the metrics for a given MinIO deployment + + .. cond:: k8s + + The Telegraf Prometheus plugin also supports Kubernetes-specific features, such as scraping the ``minio`` service for a given MinIO Tenant. + + Configuring Telegraf is out of scope for this procedure. + You can use this procedure as general guidance for configuring Telegraf to scrape MinIO metrics. + +.. container:: procedure + + 1. Configure Public Access to MinIO Metrics + + Set the :envvar:`MINIO_PROMETHEUS_AUTH_TYPE` environment variable to ``"public"`` for all nodes in the MinIO deployment. + You can then restart the deployment to allow public access to MinIO metrics. + + You can validate the change by attempting to ``curl`` the metrics endpoint: + + .. code-block:: shell + :class: copyable + + curl https://HOSTNAME/minio/v2/metrics/cluster + + Replace ``HOSTNAME`` with the URL of the load balancer or reverse proxy through which you access the MinIO deployment. + You can alternatively specify any single node as ``HOSTNAME:PORT``, specifying the MinIO server API port in addition to the node hostname. + + The response body should include a list of collected MinIO metrics. + + #. Log into the InfluxDB UI and Create a Bucket + + Select the :influxdb-docs:`Organization ` under which you want to store MinIO metrics. + + Create a :influxdb-docs:`New Bucket ` in which to store metrics for the MinIO deployment. + + #. Create a new Scraping Source + + Create a :influxdb-docs:`new InfluxDB Scraper `. + + Specify the full URL to the MinIO deployment, including the metrics endpoint: + + .. code-block:: shell + :class: copyable + + https://HOSTNAME/minio/v2/metrics/cluster + + Replace ``HOSTNAME`` with the URL of the load balancer or reverse proxy through which you access the MinIO deployment. + You can alternatively specify any single node as ``HOSTNAME:PORT``, specifying the MinIO server API port in addition to the node hostname. + + #. Validate the Data + + Use the :influxdb-docs:`DataExplorer ` to visualize the collected MinIO data. + + For example, you can set a filter on :metric:`minio_cluster_capacity_usable_total_bytes` and :metric:`minio_cluster_capacity_usable_free_bytes` to compare the total usable against total free space on the MinIO deployment. + + #. Configure a Check + + Create a :influxdb-docs:`new Check ` on a MinIO metric. + + The following example check rules provide a baseline of alerts for a MinIO deployment. + You can modify or otherwise use these examples for guidance in building your own checks. + + - Create a :guilabel:`Threshold Check` named ``MINIO_NODE_DOWN``. + + Set the filter for the :metric:`minio_cluster_nodes_offline_total` key. + + Set the :guilabel:`Thresholds` to :guilabel:`WARN` when the value is greater than :guilabel:`1` + + - Create a :guilabel:`Threshold Check` named ``MINIO_QUORUM_WARNING``. + + Set the filter for the :metric:`minio_cluster_disk_offline_total` key. + + Set the :guilabel:`Thresholds` to :guilabel:`CRITICAL` when the value is one less than your configured :ref:`Erasure Code Parity ` setting. + + For example, a deployment using EC:4 should set this value to ``3``. + + Configure your :influxdb-docs:`Notification endpoints ` and :influxdb-docs:`Notification rules ` such that checks of each type trigger an appropriate response. +