From e8df32a9407cc782a7f3669be5dbce91f5c0a5fc Mon Sep 17 00:00:00 2001 From: Rich Loveland Date: Tue, 14 Jan 2025 17:39:43 -0500 Subject: [PATCH] WIP++ --- ...zone-configs-overwritten-during-restore.md | 1 + .../zone-configs/avoid-manual-zone-configs.md | 7 +++ src/current/v24.3/alter-database.md | 11 +++- src/current/v24.3/backup.md | 2 +- .../v24.3/cluster-setup-troubleshooting.md | 24 ++++---- .../v24.3/configure-replication-zones.md | 4 ++ src/current/v24.3/restore.md | 2 +- .../v24.3/troubleshoot-replication-zones.md | 59 +++++++++++-------- 8 files changed, 69 insertions(+), 41 deletions(-) create mode 100644 src/current/_includes/v24.3/backups/zone-configs-overwritten-during-restore.md create mode 100644 src/current/_includes/v24.3/zone-configs/avoid-manual-zone-configs.md diff --git a/src/current/_includes/v24.3/backups/zone-configs-overwritten-during-restore.md b/src/current/_includes/v24.3/backups/zone-configs-overwritten-during-restore.md new file mode 100644 index 00000000000..f17be399b3d --- /dev/null +++ b/src/current/_includes/v24.3/backups/zone-configs-overwritten-during-restore.md @@ -0,0 +1 @@ +[Zone configurations]({% link {{ page.version.version }}/configure-replication-zones.md %}) present on the destination cluster prior to a restore will be **overwritten** during a [cluster restore]({% link {{ page.version.version }}/restore.md %}#full-cluster) with the zone configurations from the [backed up cluster]({% link {{ page.version.version }}/backup.md %}#back-up-a-cluster). If there were no customized zone configurations on the cluster when the backup was taken, then after the restore the destination cluster will use the zone configuration from the [`RANGE DEFAULT` configuration]({% link {{ page.version.version }}/configure-replication-zones.md %}#view-the-default-replication-zone). diff --git a/src/current/_includes/v24.3/zone-configs/avoid-manual-zone-configs.md b/src/current/_includes/v24.3/zone-configs/avoid-manual-zone-configs.md new file mode 100644 index 00000000000..e5f57dd1d25 --- /dev/null +++ b/src/current/_includes/v24.3/zone-configs/avoid-manual-zone-configs.md @@ -0,0 +1,7 @@ +Cockroach Labs does not recommend adding zone configurations manually, for the following reasons: + +- It is easy to introduce logic errors and end up in a state where your replication is not behaving as it "should be". +- It is not easy to do proper change management and auditing of manually altered zone configurations. +- Manual zone config modifications are managed by the user with no help from the system and must be fully overwritten on each configuration change in order to take effect; this introduces another avenue for error. + +For these reasons, most users should use [Multi-region SQL statements]({% link {{ page.version.version }}/multiregion-overview.md %}) instead; if additional control is needed, [Zone config extensions]({% link {{ page.version.version }}/zone-config-extensions.md %}) can be used to augment the multi-region SQL statements. diff --git a/src/current/v24.3/alter-database.md b/src/current/v24.3/alter-database.md index f13cc0e5dc3..16db8b4e12e 100644 --- a/src/current/v24.3/alter-database.md +++ b/src/current/v24.3/alter-database.md @@ -169,6 +169,10 @@ For usage, see [Synopsis](#synopsis). If you directly change a database's zone configuration with `ALTER DATABASE ... CONFIGURE ZONE`, CockroachDB will block all [`ALTER DATABASE ... SET PRIMARY REGION`](#set-primary-region) statements on the database. {{site.data.alerts.end}} +{{site.data.alerts.callout_danger}} +{% include {{ page.version.version }}/zone-configs/avoid-manual-zone-configs.md %} +{{site.data.alerts.end}} + You can use *replication zones* to control the number and location of replicas for specific sets of data, both when replicas are first added and when they are rebalanced to maintain cluster equilibrium. For examples, see [Replication Controls](#configure-replication-zones). @@ -689,6 +693,10 @@ HINT: you must first drop super region usa before you can drop the region us-wes ### Configure replication zones +{{site.data.alerts.callout_danger}} +{% include {{ page.version.version }}/zone-configs/avoid-manual-zone-configs.md %} +{{site.data.alerts.end}} + {% include {{ page.version.version }}/sql/movr-statements-geo-partitioned-replicas.md %} #### Create a replication zone for a database @@ -715,7 +723,7 @@ You cannot `DISCARD` any zone configurations on multi-region tables, indexes, or ALTER DATABASE movr CONFIGURE ZONE DISCARD; ~~~ -#### Troubleshoot replication zones +### Troubleshoot replication zones {% include {{ page.version.version }}/see-zone-config-troubleshooting-guide.md %} @@ -1293,3 +1301,4 @@ For more information about the region survival goal, see [Surviving region failu - [`ALTER TABLE`]({% link {{ page.version.version }}/alter-table.md %}) - [Online Schema Changes]({% link {{ page.version.version }}/online-schema-changes.md %}) - [SQL Statements]({% link {{ page.version.version }}/sql-statements.md %}) +- [Troubleshoot Replication Zone Configurations]({% link {{ page.version.version}}/troubleshoot-replication-zones.md %}) diff --git a/src/current/v24.3/backup.md b/src/current/v24.3/backup.md index b388d776ba3..06d88f001d4 100644 --- a/src/current/v24.3/backup.md +++ b/src/current/v24.3/backup.md @@ -33,10 +33,10 @@ To view the contents of an backup created with the `BACKUP` statement, use [`SHO ## Considerations - [Full cluster backups](#back-up-a-cluster) include [license keys]({% link {{ page.version.version }}/licensing-faqs.md %}#set-a-license). When you [restore]({% link {{ page.version.version }}/restore.md %}) a full cluster backup that includes a license, the license is also restored. -- [Zone configurations]({% link {{ page.version.version }}/configure-replication-zones.md %}) present on the destination cluster prior to a restore will be **overwritten** during a [cluster restore]({% link {{ page.version.version }}/restore.md %}#full-cluster) with the zone configurations from the [backed up cluster](#back-up-a-cluster). If there were no customized zone configurations on the cluster when the backup was taken, then after the restore the destination cluster will use the zone configuration from the [`RANGE DEFAULT` configuration]({% link {{ page.version.version }}/configure-replication-zones.md %}#view-the-default-replication-zone). - You cannot restore a backup of a multi-region database into a single-region database. - Exclude a table's row data from a backup using the [`exclude_data_from_backup`]({% link {{ page.version.version }}/take-full-and-incremental-backups.md %}#exclude-a-tables-data-from-backups) parameter. - `BACKUP` is a blocking statement. To run a backup job asynchronously, use the `DETACHED` option. See the [options](#options) below. +- {% include {{ page.version.version }}/backups/zone-configs-overwritten-during-restore.md %} ### Storage considerations diff --git a/src/current/v24.3/cluster-setup-troubleshooting.md b/src/current/v24.3/cluster-setup-troubleshooting.md index 8bfa5b8f79e..ea3674128f6 100644 --- a/src/current/v24.3/cluster-setup-troubleshooting.md +++ b/src/current/v24.3/cluster-setup-troubleshooting.md @@ -587,6 +587,18 @@ If you still see under-replicated/unavailable ranges on the Cluster Overview pag 1. To view the **Range Report** for a range, click on the range number in the **Under-replicated (or slow)** table or **Unavailable** table. 1. On the Range Report page, scroll down to the **Simulated Allocator Output** section. The table contains an error message which explains the reason for the under-replicated range. Follow the guidance in the message to resolve the issue. If you need help understanding the error or the guidance, [file an issue]({% link {{ page.version.version }}/file-an-issue.md %}). Please be sure to include the full Range Report and error message when you submit the issue. +#### Check for under-replicated or unavailable data + +To see if any data is under-replicated or unavailable in your cluster, follow the steps described in [Critical nodes endpoint]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#critical-nodes-endpoint). + +#### Check for replication zone constraint violations + +To see if any of your cluster's [data placement constraints]({% link {{ page.version.version }}/configure-replication-zones.md %}#replication-constraints) are being violated, follow the steps described in [Troubleshoot Replication Zone Configurations]({% link {{ page.version.version}}/troubleshoot-replication-zones.md %}). + +#### Check for critical localities + +To see which of your [localities]({% link {{ page.version.version }}/cockroach-start.md %}#locality) (if any) are critical, follow the steps described in the [Critical nodes endpoint documentation]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#critical-nodes-endpoint). A locality is "critical" for a range if all of the nodes in that locality becoming [unreachable](#node-liveness-issues) would cause the range to become unavailable. In other words, the locality contains a majority of the range's replicas. + ## Node liveness issues "Node liveness" refers to whether a node in your cluster has been determined to be "dead" or "alive" by the rest of the cluster. This is achieved using checks that ensure that each node connected to the cluster is updating its liveness record. This information is shared with the rest of the cluster using an internal gossip protocol. @@ -633,18 +645,6 @@ If your cluster is in a partially-available state due to a recent node or networ Even with `server.eventlog.enabled` set to `false`, notable log events are still sent to configured [log sinks]({% link {{ page.version.version }}/configure-logs.md %}#configure-log-sinks) as usual. -## Check for under-replicated or unavailable data - -To see if any data is under-replicated or unavailable in your cluster, follow the steps described in [Critical nodes endpoint]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#critical-nodes-endpoint). - -## Check for replication zone constraint violations - -To see if any of your cluster's [data placement constraints]({% link {{ page.version.version }}/configure-replication-zones.md %}#replication-constraints) are being violated, follow the steps described in [Troubleshoot Replication Zone Configurations]({% link {{ page.version.version}}/troubleshoot-replication-zones.md %}). - -## Check for critical localities - -To see which of your [localities]({% link {{ page.version.version }}/cockroach-start.md %}#locality) (if any) are critical, follow the steps described in the [Critical nodes endpoint documentation]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#critical-nodes-endpoint). A locality is "critical" for a range if all of the nodes in that locality becoming [unreachable](#node-liveness-issues) would cause the range to become unavailable. In other words, the locality contains a majority of the range's replicas. - ## Something else? If we do not have a solution here, you can try using our other [support resources]({% link {{ page.version.version }}/support-resources.md %}), including: diff --git a/src/current/v24.3/configure-replication-zones.md b/src/current/v24.3/configure-replication-zones.md index 59418914387..40c873f8048 100644 --- a/src/current/v24.3/configure-replication-zones.md +++ b/src/current/v24.3/configure-replication-zones.md @@ -28,6 +28,10 @@ This page explains how replication zones work and how to use the `ALTER ... CONF To configure replication zones, a user must be a member of the [`admin` role]({% link {{ page.version.version }}/security-reference/authorization.md %}#admin-role) or have been granted [`CREATE`]({% link {{ page.version.version }}/security-reference/authorization.md %}#supported-privileges) or [`ZONECONFIG`]({% link {{ page.version.version }}/security-reference/authorization.md %}#supported-privileges) privileges. To configure [`system` objects](#for-system-data), the user must be a member of the `admin` role. {{site.data.alerts.end}} +{{site.data.alerts.callout_danger}} +{% include {{ page.version.version }}/zone-configs/avoid-manual-zone-configs.md %} +{{site.data.alerts.end}} + ## Overview Every [range]({% link {{ page.version.version }}/architecture/overview.md %}#architecture-range) in the cluster is part of a replication zone. Each range's zone configuration is taken into account as ranges are rebalanced across the cluster to ensure that any constraints are honored. diff --git a/src/current/v24.3/restore.md b/src/current/v24.3/restore.md index e8aac279600..10f26aaf49d 100644 --- a/src/current/v24.3/restore.md +++ b/src/current/v24.3/restore.md @@ -25,9 +25,9 @@ For details on restoring across versions of CockroachDB, see [Restoring Backups - `RESTORE` only supports backups taken on a cluster on a specific major version into a cluster that is on the same version or the next major version. Refer to the [Restoring Backups Across Versions]({% link {{ page.version.version }}/restoring-backups-across-versions.md %}) page for more detail. - `RESTORE` is a blocking statement. To run a restore job asynchronously, use the [`DETACHED`](#detached) option. - `RESTORE` no longer requires an {{ site.data.products.enterprise }} license, regardless of the options passed to it or to the backup it is restoring. -- [Zone configurations]({% link {{ page.version.version }}/configure-replication-zones.md %}) present on the destination cluster prior to a restore will be **overwritten** during a [cluster restore]({% link {{ page.version.version }}/restore.md %}#full-cluster) with the zone configurations from the [backed up cluster]({% link {{ page.version.version }}/backup.md %}#back-up-a-cluster). If there were no customized zone configurations on the cluster when the backup was taken, then after the restore the destination cluster will use the zone configuration from the [`RANGE DEFAULT` configuration]({% link {{ page.version.version }}/configure-replication-zones.md %}#view-the-default-replication-zone). - You cannot restore a backup of a multi-region database into a single-region database. - When the [`exclude_data_from_backup`]({% link {{ page.version.version }}/take-full-and-incremental-backups.md %}#exclude-a-tables-data-from-backups) parameter is set on a table, the table will not contain row data when restored. +- {% include {{ page.version.version }}/backups/zone-configs-overwritten-during-restore.md %} ## Required privileges diff --git a/src/current/v24.3/troubleshoot-replication-zones.md b/src/current/v24.3/troubleshoot-replication-zones.md index ffad1cb716d..c9711166c9c 100644 --- a/src/current/v24.3/troubleshoot-replication-zones.md +++ b/src/current/v24.3/troubleshoot-replication-zones.md @@ -1,16 +1,20 @@ --- -title: Troubleshoot Replication Zone Configurations +title: Troubleshoot Replication Zones summary: Troubleshooting guide for replication zones, which control the number and location of replicas for specific sets of data. keywords: ttl, time to live, availability zone toc: true docs_area: manage --- -This page has instructions showing how to troubleshoot scenarios where you believe replicas are not behaving as specified by your zone configurations. +This page has instructions showing how to troubleshoot scenarios where you believe replicas are not behaving as specified by your [zone configurations]({% link {{ page.version.version }}/configure-replication-zones.md %}). + +{{site.data.alerts.callout_danger}} +{% include {{ page.version.version }}/zone-configs/avoid-manual-zone-configs.md %} +{{site.data.alerts.end}} ## Prerequisites -This page assumes you have read and understood the following materials: +This page assumes you have read and understood the following: - [Replication controls > Replication zone levels]({% link {{ page.version.version }}/configure-replication-zones.md %}#replication-zone-levels), which describes how the hierarchy of inheritance of replication zones works. This is critical to understand for troubleshooting. - [Monitoring and alerting > Critical nodes endpoint]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#critical-nodes-endpoint), which is used to monitor if any of your cluster's ranges are under-replicated, or if your data placement constraints are being violated. @@ -18,23 +22,12 @@ This page assumes you have read and understood the following materials: ## Types of problems -There are several classes of common problems users encounter when [manually configuring replication zones]({% link {{ page.version.version }}/configure-replication-zones.md %}#manage-replication-zones). Cockroach Labs does not recommend adding zone configurations manually, since it is easy to introduce logic errors. It's also difficult to do proper change management and auditing of manually tweaked zone configurations. Most users should use [Multi-region SQL statements]({% link {{ page.version.version }}/multiregion-overview.md %}) instead; if more control is needed, [Zone config extensions]({% link {{ page.version.version }}/zone-config-extensions.md %}) can be used to augment the multi-region SQL statements. +There are several classes of common problems users encounter when [manually configuring replication zones]({% link {{ page.version.version }}/configure-replication-zones.md %}#manage-replication-zones). Generally, the problems tend to fall into one of the following categories: -- "the system isn't sending replicas where I told it to" -- "the system isn't managing replicas how I told it to" - -This behavior is almost always caused by a replication zone **misconfiguration**, but it can be difficult to see what the error is or how it was introduced. Zone configurations do not have much observability beyond `SHOW ZONE CONFIGURATIONS`, nor is there much built-in validation to prevent logic errors. It's easy to put the system in a state where you've told it to do two mutually incompatible things. - -The most common class of logic error occurs because of the way inheritance works for replication zone configurations. As discussed in [Replication Controls]({% link {{ page.version.version }}/configure-replication-zones.md %}#level-priorities), CockroachDB always uses the most granular replication zone available in a "bottom-up" fashion. - -When you manually set a field at, say, the table level, it overrides the value that was already set at the next level up, in the parent database. If you later change something at the database level and find that it isn't working as expected for all tables, it may be that the more-specific settings you applied at the table level are overriding the database-level settings. In other words, the system is doing what it was told, because it is respecting the table-level change you already applied. However, this may not be what you _intended_. - -As noted previously, the problems tend to fall into one of the following general categories: - -- "the system isn't sending replicas where I told it to" -- "the system isn't doing what I told it to with the replica configuration" +- "the replicas are not _where_ they should be" +- "the replicas are not _how_ they should be" Specifically: @@ -49,9 +42,13 @@ Specifically: - `gc.ttlseconds = 14400,` - `num_voters = 3,` -The most common reason why "the thing isn't going where I told it to go" or "the thing isn't doing what I told it to do" is misconfiguration. +The most common reason for these problems is misconfiguration. -[XXX](): ADD THE THING ABOUT BACKUPS OVERWRITING ZONE CONFIGS ON RESTORE? (via backup.md search for 'overwritten') +The most common class of logic error occurs because of the way inheritance works for replication zone configurations. As discussed in [Replication Controls > Level priorities]({% link {{ page.version.version }}/configure-replication-zones.md %}#level-priorities), CockroachDB always uses the most granular replication zone available in a "bottom-up" fashion. + +When you manually set a field at, say, the table level, it overrides the value that was already set at the next level up, in the parent database. If you later change something at the database level and find that it isn't working as expected for all tables, it may be that the more-specific settings you applied at the table level are overriding the database-level settings. In other words, the system is doing what it was told, because it is respecting the table-level change you already applied. However, this may not be what you _intended_. + +It's also possible that [you restored from a backup and your zone configs were overwritten](#zone-configs-are-overwritten-during-a-cluster-restore). ## Troubleshooting steps @@ -60,23 +57,33 @@ Troubleshooting zone configs is difficult because it requires running a mix of o - [`SHOW ZONE CONFIGURATIONS`]({% link {{ page.version.version }}/show-zone-configurations.md %}) for different levels of objects in the inheritance hierarchy and checking where they differ. - [`SHOW ALL ZONE CONFIGURATIONS`]({% link {{ page.version.version }}/show-zone-configurations.md %}#view-all-replication-zones) and parsing the output into a tree-like format that lets you see what has changed. ([XXX](): is this really what we want to say?) -[XXX](): WRITE ME +### confirm invalid behavior (critical nodes) + +### get range ID of ranges out of compliance + +### map range ID to schema object -### Run SQL statements +### look at schema object's zone config ## Considerations -### Replication system priorities +### Zone configs are overwritten during a cluster restore + +{% include {{ page.version.version }}/backups/zone-configs-overwritten-during-restore.md %} + +For more information about how backup and restore work, see [Backup and Restore Overview]({% link {{ page.version.version }}/backup-and-restore-overview.md %}). + +### Replication system priorities: data placement vs data durability -A further wrinkle is that the [Replication Layer]({% link {{ page.version.version }}/architecture/replication-layer.md %})'s top priority is avoiding data loss, _not_ putting replicas exactly where you told it to. For more information about this limitation, see [the data domiciling docs]({% link {{ page.version.version }}/data-domiciling.md %}#known-limitations). +As noted in [Data Domiciling with CockroachDB]({% link {{ page.version.version }}/data-domiciling.md %}): -That said, the replication logic takes user-supplied zone configurations into account when allocating replicas. +> [Zone configs]({% link {{ page.version.version }}/configure-replication-zones.md %}) can be used for data placement but these features were historically built for performance, not for domiciling. The [replication system]({% link {{ page.version.version }}/architecture/replication-layer.md %})'s top priority is to prevent the loss of data and it may override the zone configurations if necessary to ensure data durability. ### Replication lag -Sometimes the changes you make to a zone configuration are not reflected in the running system for a few minutes. Depending on the size of the cluster, this is expected behavior. It can take several minutes for changes to replica locality you specify in a changed zone config to propagate across a cluster. In general, the larger the cluster, the longer this process may take, due to the amount of data shuffling that occurs. In general, it's better to avoid making big changes to replica constraints on a large cluster unless you are prepared for it to take some time. +Sometimes the changes you make to a zone configuration are not reflected in the running system for a few minutes. Depending on the size of the cluster, this is expected behavior. It can take several minutes for changes to replica locality settings to propagate across a large cluster. In general, the larger the cluster, the longer this process may take, due to the amount of data movement that occurs. There is also CPU cost, since [XXX](): LINK TO CPU BASED REBALANCING -For more information about how to troubleshoot replication issues, especially under-replicated ranges, see [Troubleshoot Self-Hosted Setup > Replication issues]({% link {{ page.version.version }}/cluster-setup-troubleshooting.md %}#replication-issues). +In general, it's better to avoid making big changes to replica constraints on a large cluster unless you are prepared for it to take some time. ## See also