Improvements to replication conceptual docs (#885)

2025-07-30 07:03:26 +03:00 · 2023-06-20 17:13:34 -04:00
parent d64dfd146b
commit 6991295fb9
27 changed files with 25361 additions and 54 deletions
--- a/source/operations/concepts/architecture.rst
+++ b/source/operations/concepts/architecture.rst
@ -134,18 +134,6 @@ MinIO :ref:`site replication <minio-site-replication-overview>` provides support
      A MinIO multi-site deployment with three peers.
      Write operations on one peer replicate to all other peers in the configuration automatically.

-Each peer site consists of an independent set of MinIO hosts, ideally having matching pool configurations.
-   The architecture of each peer site should closely match to ensure consistent performance and behavior between sites.
-   All peer sites must use the same primary identity provider, and during initial configuration only one peer site can have any data.
-
-   .. figure:: /images/architecture/architecture-multi-site-setup.svg
-      :figwidth: 100%
-      :alt: Diagram of a multi-site deployment during initial setup
-
-      The initial setup of a MinIO multi-site deployment.
-      The first peer site replicates all required information to other peers in the configuration.
-      Adding new peers uses the same sequence for synchronizing data.
-
 Replication performance primarily depends on the network latency between each peer site.
   With geographically distributed peer sites, high latency between sites can result in significant replication lag.
   This can compound with workloads that are near or at the deployment's overall performance capacity, as the replication process itself requires sufficient free :abbr:`I/O (Input / Output)` to synchronize objects.
@ -162,27 +150,10 @@ Deploying a global load balancer or similar network appliance with support for s

   .. figure:: /images/architecture/architecture-load-balancer-multi-site.svg
      :figwidth: 100%
-      :alt: Diagram of a multi-site deployment with a failed site
+      :alt: Diagram of a site replication deployment with two sites

-      One of the peer sites has failed completely.
-      The load balancer automatically routes requests to the remaining healthy peer site.
+      The Load Balancer automatically routes client requests using configured logic (geo-local, latency, etc.).
+      Data written to one site automatically replicates to the other peer site.

   The load balancer should meet the same requirements as single-site deployments regarding connection balancing and header preservation.
-   MinIO replication handles transient failures by queuing objects for replication.
-
-MinIO replication can automatically heal a site that has partial or total data loss due to transient or sustained downtime. 
-   If a peer site completely fails, you can remove that site from the configuration entirely.
-   The load balancer configuration should also remove that site to avoid routing client requests to the offline site.
-
-   You can then restore the peer site, either after repairing the original hardware or replacing it entirely, by adding it back to the site replication configuration.
-   MinIO automatically begins resynchronizing existing data while continuously replicating new data.
-
-   .. figure:: /images/architecture/architecture-load-balancer-multi-site-healing.svg
-      :figwidth: 100%
-      :alt: Diagram of a multi-site deployment with a healing site
-
-      The peer site has recovered and reestablished connectivity with its healthy peers.
-      MinIO automatically works through the replication queue to catch the site back up.
-
-   Once all data synchronizes, you can restore normal connectivity to that site.
-   Depending on the amount of replication lag, latency between sites and overall workload :abbr:`I/O (Input / Output)`, you may need to temporarily stop write operations to allow the sites to completely catch up.
+   MinIO replication handles transient failures by queuing objects for replication.
--- a/source/operations/concepts/availability-and-resiliency.rst
+++ b/source/operations/concepts/availability-and-resiliency.rst
@ -24,6 +24,9 @@ This page provides an overview of MinIO's availability and resiliency design and
   Community users can seek support on the `MinIO Community Slack <https://slack.min.io>`__. 
   Community Support is best-effort only and has no SLAs around responsiveness.

+Distributed MinIO Deployments
+-----------------------------
+
 MinIO implements :ref:`erasure coding <minio-erasure-coding>` as the core component in providing availability and resiliency during drive or node-level failure events.
   MinIO partitions each object into data and :ref:`parity <minio-ec-parity>` shards and distributes those shards across a single :ref:`erasure set <minio-ec-erasure-set>`.

@ -158,4 +161,48 @@ For multi-pool MinIO deployments, each pool requires at least one erasure set ma
   Use replicated remotes to restore the lost data to the deployment.
   All data stored on the healthy pools remain safe on disk.

+Replicated MinIO Deployments
+----------------------------
+
+MinIO implements :ref:`site replication <minio-site-replication-overview>` as the primary measure for ensuring Business Continuity and Disaster Recovery (BC/DR) in the case of both small and large scale data loss in a MinIO deployment.
+   .. figure:: /images/availability/availability-multi-site-setup.svg
+      :figwidth: 100%
+      :alt: Diagram of a multi-site deployment during initial setup
+
+      Each peer site is deployed to an independent datacenter to provide protection from large-scale failure or disaster.
+      If one datacenter goes completely offline, clients can fail over to the other site.
+
+MinIO replication can automatically heal a site that has partial or total data loss due to transient or sustained downtime. 
+   .. figure:: /images/availability/availability-multi-site-healing.svg
+      :figwidth: 100%
+      :alt: Diagram of a multi-site deployment while healing
+
+      Datacenter 2 was down and Site B requires resynchronization.
+      The Load Balancer handles routing operations to Site A in Datacenter 1.
+      Site A continuously replicates data to Site B.
+
+   Once all data synchronizes, you can restore normal connectivity to that site.
+   Depending on the amount of replication lag, latency between sites and overall workload :abbr:`I/O (Input / Output)`, you may need to temporarily stop write operations to allow the sites to completely catch up.
+
+   If a peer site completely fails, you can remove that site from the configuration entirely.
+   The load balancer configuration should also remove that site to avoid routing client requests to the offline site.
+
+   You can then restore the peer site, either after repairing the original hardware or replacing it entirely, by :ref:`adding it back to the site replication configuration <minio-expand-site-replication>`.
+   MinIO automatically begins resynchronizing existing data while continuously replicating new data.
+
+Sites can continue processing operations during resynchronization by proxying ``GET/HEAD`` requests to healthy peer sites
+   .. figure:: /images/availability/availability-multi-site-proxy.svg
+      :figwidth: 100%
+      :alt: Diagram of a multi-site deployment while healing
+
+      Site B does not have the requested object, possibly due to replication lag.
+      It proxies the ``GET`` request to Site A.
+      Site A returns the object, which Site B then returns to the requesting client.
+
+   The client receives the results from first peer site to return *any* version of the requested object.
+
+   ``PUT`` and ``DELETE`` operations synchronize using the regular replication process.
+   ``LIST`` operations do not proxy and require clients to issue them exclusively against healthy peers.
+
+

--- a/source/operations/install-deploy-manage/multi-site-replication.rst
+++ b/source/operations/install-deploy-manage/multi-site-replication.rst
@ -88,6 +88,25 @@ Any MinIO deployment in the site replication configuration can resynchronize dam
   If one site loses data for any reason, resynchronize the data from another healthy site with :mc-cmd:`mc admin replicate resync`.
   This launches an active process that resynchronizes the data without waiting for the passive MinIO scanner to recognize the missing data.

+Proxy to Other Sites
+~~~~~~~~~~~~~~~~~~~~
+
+MinIO peer sites can proxy ``GET/HEAD`` requests for an object to other peers to check if it exists.
+This allows a site that is healing or lagging behind other peers to still return an object persisted to other sites.
+
+For example:
+
+1. A client issues ``GET("data/invoices/january.xls")`` to ``Site1``
+2. ``Site1`` cannot locate the object
+3. ``Site1`` proxies the request to ``Site2``
+4. ``Site2`` returns the latest version of the requested object
+5. ``Site1`` returns the proxied object to the client
+
+For ``GET/HEAD`` requests that do *not* include a unique version ID, the proxy request returns the *latest* version of that object on the peer site.
+This may result in retrieval of a non-current version of an object, such as if the responding peer site is also experiencing replication lag.
+
+MinIO does not proxy ``LIST``, ``DELETE``, and ``PUT`` operations.
+
 Prerequisites
 -------------