mirror of
https://github.com/minio/docs.git
synced 2025-07-31 18:04:52 +03:00
Docs Multiplatform Slice
This commit is contained in:
122
source/operations/data-recovery/recover-after-drive-failure.rst
Normal file
122
source/operations/data-recovery/recover-after-drive-failure.rst
Normal file
@ -0,0 +1,122 @@
|
||||
.. _minio-restore-hardware-failure-drive:
|
||||
|
||||
======================
|
||||
Drive Failure Recovery
|
||||
======================
|
||||
|
||||
.. default-domain:: minio
|
||||
|
||||
.. contents:: Table of Contents
|
||||
:local:
|
||||
:depth: 1
|
||||
|
||||
MinIO supports hot-swapping failed drives with new healthy drives. MinIO detects
|
||||
and heals those drives without requiring any node or deployment-level restart.
|
||||
MinIO healing occurs only on the replaced drive(s) and does not typically impact
|
||||
deployment performance.
|
||||
|
||||
MinIO healing ensures consistency and correctness of all data restored onto the
|
||||
drive. **Do not** attempt to manually recover or migrate data from the failed
|
||||
drive onto the new healthy drive.
|
||||
|
||||
The following steps provide a more detailed walkthrough of drive replacement.
|
||||
These steps assume a MinIO deployment where each node manages drives using
|
||||
``/etc/fstab`` with per-drive labels as per the
|
||||
:ref:`documented prerequisites <minio-installation>`.
|
||||
|
||||
1) Unmount the failed drive(s)
|
||||
------------------------------
|
||||
|
||||
Unmount each failed drive using ``umount``. For example, the following
|
||||
command unmounts the drive at ``/dev/sdb``:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
umount /dev/sdb
|
||||
|
||||
2) Replace the failed drive(s)
|
||||
------------------------------
|
||||
|
||||
Remove the failed drive(s) from the node hardware and replace it with known
|
||||
healthy drive(s). Replacement drives *must* meet the following requirements:
|
||||
|
||||
- :ref:`XFS formatted <deploy-minio-distributed-prereqs-storage>` and empty.
|
||||
- Same drive type (e.g. HDD, SSD, NVMe).
|
||||
- Equal or greater performance.
|
||||
- Equal or greater capacity.
|
||||
|
||||
Using a replacement drive with greater capacity does not increase the total
|
||||
cluster storage. MinIO uses the *smallest* drive's capacity as the ceiling for
|
||||
all drives in the :ref:`Server Pool <minio-intro-server-pool>`.
|
||||
|
||||
The following command formats a drive as XFS and assigns it a label to match
|
||||
the failed drive.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
mkfs.xfs /dev/sdb -L DISK1
|
||||
|
||||
MinIO **strongly recommends** using label-based mounting to ensure consistent
|
||||
drive order that persists through system restarts.
|
||||
|
||||
3) Review and Update ``fstab``
|
||||
------------------------------
|
||||
|
||||
Review the ``/etc/fstab`` file and update as needed such that the entry for
|
||||
the failed disk points to the newly formatted replacement.
|
||||
|
||||
- If using label-based disk assignment, ensure that each label points to the
|
||||
correct newly formatted disk.
|
||||
|
||||
- If using UUID-based disk assignment, update the UUID for each point based on
|
||||
the newly formatted disk. You can use ``lsblk`` to view disk UUIDs.
|
||||
|
||||
For example, consider
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ cat /etc/fstab
|
||||
|
||||
# <file system> <mount point> <type> <options> <dump> <pass>
|
||||
LABEL=DISK1 /mnt/disk1 xfs defaults,noatime 0 2
|
||||
LABEL=DISK2 /mnt/disk2 xfs defaults,noatime 0 2
|
||||
LABEL=DISK3 /mnt/disk3 xfs defaults,noatime 0 2
|
||||
LABEL=DISK4 /mnt/disk4 xfs defaults,noatime 0 2
|
||||
|
||||
Given the previous example command, no changes are required to
|
||||
``fstab`` since the replacement disk at ``/mnt/disk1`` uses the same
|
||||
label ``DISK1`` as the failed disk.
|
||||
|
||||
4) Remount the Replaced Drive(s)
|
||||
--------------------------------
|
||||
|
||||
Use ``mount -a`` to remount the drives unmounted at the beginning of this
|
||||
procedure:
|
||||
|
||||
.. code-block:: shell
|
||||
:class: copyable
|
||||
|
||||
mount -a
|
||||
|
||||
The command should result in remounting of all of the replaced drives.
|
||||
|
||||
5) Monitor MinIO for Drive Detection and Healing Status
|
||||
-------------------------------------------------------
|
||||
|
||||
Use :mc-cmd:`mc admin console` command *or* ``journalctl -u minio`` for
|
||||
``systemd``-managed installations to monitor the server log output after
|
||||
remounting drives. The output should include messages identifying each formatted
|
||||
and empty drive.
|
||||
|
||||
Use :mc-cmd:`mc admin heal` to monitor the overall healing status on the
|
||||
deployment. MinIO aggressively heals replaced drive(s) to ensure rapid recovery
|
||||
from the degraded state.
|
||||
|
||||
6) Next Steps
|
||||
-------------
|
||||
|
||||
Monitor the cluster for any further drive failures. Some drive batches may fail
|
||||
in close proximity to each other. Deployments seeing higher than expected drive
|
||||
failure rates should schedule dedicated maintenance around replacing the known
|
||||
bad batch. Consider using `MinIO SUBNET <https://min.io/pricing?jmp=docs>`__ to
|
||||
coordinate with MinIO engineering around guidance for any such operations.
|
@ -0,0 +1,90 @@
|
||||
.. _minio-restore-hardware-failure-node:
|
||||
|
||||
=====================
|
||||
Node Failure Recovery
|
||||
=====================
|
||||
|
||||
.. default-domain:: minio
|
||||
|
||||
.. contents:: Table of Contents
|
||||
:local:
|
||||
:depth: 1
|
||||
|
||||
If a MinIO node suffers complete hardware failure (e.g. loss of all drives,
|
||||
data, etc.), the node begins healing operations once it rejoins the deployment.
|
||||
MinIO healing occurs only on the replaced hardware and does not typically impact
|
||||
deployment performance.
|
||||
|
||||
MinIO healing ensures consistency and correctness of all data restored onto the
|
||||
drive. **Do not** attempt to manually recover or migrate data from the failed
|
||||
node onto the new healthy node.
|
||||
|
||||
The replacement node hardware should be substantially similar to the failed
|
||||
node. There are no negative performance implications to using improved hardware.
|
||||
|
||||
The replacement drive hardware should be substantially similar to the failed
|
||||
drive. For example, replace a failed SSD with another SSD drive of the same
|
||||
capacity. While you can use drives with larger capacity, MinIO uses the
|
||||
*smallest* drive's capacity as the ceiling for all drives in the
|
||||
:ref:`Server Pool <minio-intro-server-pool>`.
|
||||
|
||||
The following steps provide a more detailed walkthrough of node replacement.
|
||||
These steps assume a MinIO deployment where each node has a DNS hostname
|
||||
as per the :ref:`documented prerequisites <minio-installation>`.
|
||||
|
||||
1) Start the Replacement Node
|
||||
-----------------------------
|
||||
|
||||
Ensure the new node has received all necessary security, firmware, and OS
|
||||
updates as per industry, regulatory, or organizational standards and
|
||||
requirements.
|
||||
|
||||
The new node software configuration *must* match that of the other nodes in the
|
||||
deployment, including but not limited to the OS and Kernel versions and
|
||||
configurations. Heterogeneous software configurations may result in unexpected
|
||||
or undesired behavior in the deployment.
|
||||
|
||||
2) Update Hostname for the New Node
|
||||
-----------------------------------
|
||||
|
||||
*Optional* This step is only required if the replacement node has a
|
||||
different IP address from the failed host.
|
||||
|
||||
Ensure the hostname associated to the failed node now resolves to the new node.
|
||||
|
||||
For example, if ``https://minio-1.example.net`` previously resolved to the
|
||||
failed host, it should now resolve to the new host.
|
||||
|
||||
3) Download and Prepare the MinIO Server
|
||||
----------------------------------------
|
||||
|
||||
Follow the :ref:`deployment procedure <minio-installation>` to download
|
||||
and run the MinIO server using a matching configuration as all other nodes
|
||||
in the deployment.
|
||||
|
||||
- The MinIO server version *must* match across all nodes
|
||||
- The MinIO service and environment file configurations *must* match across
|
||||
all nodes.
|
||||
|
||||
4) Rejoin the node to the deployment
|
||||
------------------------------------
|
||||
|
||||
Start the MinIO server process on the node and monitor the process output
|
||||
using :mc-cmd:`mc admin console` or by monitoring the MinIO service logs
|
||||
using ``journalctl -u minio`` for ``systemd`` managed installations.
|
||||
|
||||
The server output should indicate that it has detected the other nodes
|
||||
in the deployment and begun healing operations.
|
||||
|
||||
Use :mc-cmd:`mc admin heal` to monitor overall healing status on the
|
||||
deployment. MinIO aggressively heals the node to ensure rapid recovery
|
||||
from the degraded state.
|
||||
|
||||
5) Next Steps
|
||||
-------------
|
||||
|
||||
Continue monitoring the deployment until healing completes. Deployments with
|
||||
persistent and repeated node failures should schedule dedicated maintenance to
|
||||
identify the root cause. Consider using
|
||||
`MinIO SUBNET <https://min.io/pricing?jmp=docs>`__ to coordinate with MinIO
|
||||
engineering around guidance for any such operations.
|
@ -0,0 +1,13 @@
|
||||
.. _minio-restore-hardware-failure-site:
|
||||
|
||||
==========================
|
||||
Recover after Site Failure
|
||||
==========================
|
||||
|
||||
.. default-domain:: minio
|
||||
|
||||
.. contents:: Table of Contents
|
||||
:local:
|
||||
:depth: 1
|
||||
|
||||
ToDo- site replication recovery procedure
|
Reference in New Issue
Block a user