Docs Multiplatform Slice

2025-07-31 18:04:52 +03:00 · 2022-05-06 16:44:42 -04:00
parent df33ddee6a
commit b99c20a16f
134 changed files with 3689 additions and 2200 deletions
--- a/source/operations/data-recovery/recover-after-drive-failure.rst
+++ b/source/operations/data-recovery/recover-after-drive-failure.rst
@ -0,0 +1,122 @@
+.. _minio-restore-hardware-failure-drive:
+
+======================
+Drive Failure Recovery
+======================
+
+.. default-domain:: minio
+
+.. contents:: Table of Contents
+   :local:
+   :depth: 1
+
+MinIO supports hot-swapping failed drives with new healthy drives. MinIO detects
+and heals those drives without requiring any node or deployment-level restart.
+MinIO healing occurs only on the replaced drive(s) and does not typically impact
+deployment performance.
+
+MinIO healing ensures consistency and correctness of all data restored onto the
+drive. **Do not** attempt to manually recover or migrate data from the failed
+drive onto the new healthy drive.
+
+The following steps provide a more detailed walkthrough of drive replacement.
+These steps assume a MinIO deployment where each node manages drives using
+``/etc/fstab`` with per-drive labels as per the
+:ref:`documented prerequisites <minio-installation>`.
+
+1) Unmount the failed drive(s)
+------------------------------
+
+Unmount each failed drive using ``umount``. For example, the following
+command unmounts the drive at ``/dev/sdb``:
+
+.. code-block:: shell
+
+   umount /dev/sdb
+
+2) Replace the failed drive(s)
+------------------------------
+
+Remove the failed drive(s) from the node hardware and replace it with known
+healthy drive(s). Replacement drives *must* meet the following requirements:
+
+- :ref:`XFS formatted <deploy-minio-distributed-prereqs-storage>` and empty.
+- Same drive type (e.g. HDD, SSD, NVMe).
+- Equal or greater performance.
+- Equal or greater capacity.
+
+Using a replacement drive with greater capacity does not increase the total
+cluster storage. MinIO uses the *smallest* drive's capacity as the ceiling for
+all drives in the :ref:`Server Pool <minio-intro-server-pool>`.
+
+The following command formats a drive as XFS and assigns it a label to match
+the failed drive.
+
+.. code-block:: shell
+
+   mkfs.xfs /dev/sdb -L DISK1
+
+MinIO **strongly recommends** using label-based mounting to ensure consistent
+drive order that persists through system restarts.
+
+3) Review and Update ``fstab``
+------------------------------
+
+Review the ``/etc/fstab`` file and update as needed such that the entry for
+the failed disk points to the newly formatted replacement.
+
+- If using label-based disk assignment, ensure that each label points to the
+  correct newly formatted disk.
+
+- If using UUID-based disk assignment, update the UUID for each point based on
+  the newly formatted disk. You can use ``lsblk`` to view disk UUIDs.
+
+For example, consider 
+
+.. code-block:: shell
+
+   $ cat /etc/fstab
+
+     # <file system>  <mount point>  <type>  <options>         <dump>  <pass>
+     LABEL=DISK1      /mnt/disk1     xfs     defaults,noatime  0       2
+     LABEL=DISK2      /mnt/disk2     xfs     defaults,noatime  0       2
+     LABEL=DISK3      /mnt/disk3     xfs     defaults,noatime  0       2
+     LABEL=DISK4      /mnt/disk4     xfs     defaults,noatime  0       2
+
+Given the previous example command, no changes are required to 
+``fstab`` since the replacement disk at ``/mnt/disk1`` uses the same
+label ``DISK1`` as the failed disk.
+
+4) Remount the Replaced Drive(s)
+--------------------------------
+
+Use ``mount -a`` to remount the drives unmounted at the beginning of this
+procedure:
+
+.. code-block:: shell
+   :class: copyable
+
+   mount -a
+
+The command should result in remounting of all of the replaced drives.
+
+5) Monitor MinIO for Drive Detection and Healing Status
+-------------------------------------------------------
+
+Use :mc-cmd:`mc admin console` command *or* ``journalctl -u minio`` for
+``systemd``-managed installations to monitor the server log output after
+remounting drives. The output should include messages identifying each formatted
+and empty drive.
+
+Use :mc-cmd:`mc admin heal` to monitor the overall healing status on the
+deployment. MinIO aggressively heals replaced drive(s) to ensure rapid recovery
+from the degraded state.
+
+6) Next Steps
+-------------
+
+Monitor the cluster for any further drive failures. Some drive batches may fail
+in close proximity to each other. Deployments seeing higher than expected drive
+failure rates should schedule dedicated maintenance around replacing the known
+bad batch. Consider using `MinIO SUBNET <https://min.io/pricing?jmp=docs>`__ to
+coordinate with MinIO engineering around guidance for any such operations.
--- a/source/operations/data-recovery/recover-after-node-failure.rst
+++ b/source/operations/data-recovery/recover-after-node-failure.rst
@ -0,0 +1,90 @@
+.. _minio-restore-hardware-failure-node:
+
+=====================
+Node Failure Recovery
+=====================
+
+.. default-domain:: minio
+
+.. contents:: Table of Contents
+   :local:
+   :depth: 1
+
+If a MinIO node suffers complete hardware failure (e.g. loss of all drives,
+data, etc.), the node begins healing operations once it rejoins the deployment.
+MinIO healing occurs only on the replaced hardware and does not typically impact
+deployment performance.
+
+MinIO healing ensures consistency and correctness of all data restored onto the
+drive. **Do not** attempt to manually recover or migrate data from the failed
+node onto the new healthy node.
+
+The replacement node hardware should be substantially similar to the failed
+node. There are no negative performance implications to using improved hardware.
+
+The replacement drive hardware should be substantially similar to the failed
+drive. For example, replace a failed SSD with another SSD drive of the same
+capacity. While you can use drives with larger capacity, MinIO uses the
+*smallest* drive's capacity as the ceiling for all drives in the 
+:ref:`Server Pool <minio-intro-server-pool>`.
+
+The following steps provide a more detailed walkthrough of node replacement.
+These steps assume a MinIO deployment where each node has a DNS hostname 
+as per the :ref:`documented prerequisites <minio-installation>`.
+
+1) Start the Replacement Node
+-----------------------------
+
+Ensure the new node has received all necessary security, firmware, and OS
+updates as per industry, regulatory, or organizational standards and
+requirements.
+
+The new node software configuration *must* match that of the other nodes in the
+deployment, including but not limited to the OS and Kernel versions and
+configurations. Heterogeneous software configurations may result in unexpected
+or undesired behavior in the deployment.
+
+2) Update Hostname for the New Node
+-----------------------------------
+
+*Optional* This step is only required if the replacement node has a
+different IP address from the failed host.
+
+Ensure the hostname associated to the failed node now resolves to the new node.
+
+For example, if ``https://minio-1.example.net`` previously resolved to the
+failed host, it should now resolve to the new host.
+
+3) Download and Prepare the MinIO Server
+----------------------------------------
+
+Follow the :ref:`deployment procedure <minio-installation>` to download
+and run the MinIO server using a matching configuration as all other nodes
+in the deployment.
+
+- The MinIO server version *must* match across all nodes
+- The MinIO service and environment file configurations *must* match across
+  all nodes.
+
+4) Rejoin the node to the deployment
+------------------------------------
+
+Start the MinIO server process on the node and monitor the process output
+using :mc-cmd:`mc admin console` or by monitoring the MinIO service logs
+using ``journalctl -u minio`` for ``systemd`` managed installations.
+
+The server output should indicate that it has detected the other nodes
+in the deployment and begun healing operations.
+
+Use :mc-cmd:`mc admin heal` to monitor overall healing status on the
+deployment. MinIO aggressively heals the node to ensure rapid recovery
+from the degraded state.
+
+5) Next Steps
+-------------
+
+Continue monitoring the deployment until healing completes. Deployments with
+persistent and repeated node failures should schedule dedicated maintenance to
+identify the root cause. Consider using
+`MinIO SUBNET <https://min.io/pricing?jmp=docs>`__ to coordinate with MinIO
+engineering around guidance for any such operations.
--- a/source/operations/data-recovery/recover-after-site-failure.rst
+++ b/source/operations/data-recovery/recover-after-site-failure.rst
@ -0,0 +1,13 @@
+.. _minio-restore-hardware-failure-site:
+
+==========================
+Recover after Site Failure
+==========================
+
+.. default-domain:: minio
+
+.. contents:: Table of Contents
+   :local:
+   :depth: 1
+
+ToDo- site replication recovery procedure