Skip to content

DOCS-10575-DDR #29293

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 56 commits into
base: master
Choose a base branch
from
Open

DOCS-10575-DDR #29293

wants to merge 56 commits into from

Conversation

iadjivon
Copy link
Contributor

@iadjivon iadjivon commented May 12, 2025

What does this PR do? What is the motivation?

New DDR doc.
Editorial Review: https://datadoghq.atlassian.net/browse/DOCS-11215

Merge instructions

Merge readiness:

  • Ready for merge

For Datadog employees:
Merge queue is enabled in this repo. Your branch name MUST follow the <name>/<description> convention and include the forward slash (/). Without this format, your pull request will not pass in CI, the GitLab pipeline will not run, and you won't get a branch preview. Getting a branch preview makes it easier for us to check any issues with your PR, such as broken links.

If your branch doesn't follow this format, rename it or create a new branch and PR.

To have your PR automatically merged after it receives the required reviews, add the following PR comment:

/merge

Additional notes

@iadjivon iadjivon requested a review from a team as a code owner May 12, 2025 19:24
@github-actions github-actions bot added the Guide Content impacting a guide label May 12, 2025
@iadjivon iadjivon added the okr11 label May 12, 2025
Copy link
Contributor

Preview links (active after the build_preview check completes)

New or renamed files

@datadog-datadog-prod-us1
Copy link

datadog-datadog-prod-us1 bot commented May 12, 2025

No data reported at this time.
This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 6a55768 | Docs | Was this helpful? Give us feedback!

@jhgilbert jhgilbert added the WORK IN PROGRESS No review needed, it's a wip ;) label May 12, 2025
@jhgilbert
Copy link
Contributor

Hi @iadjivon! I see you noted this is a work in progress -- I added the "work in progress" label to keep this PR out of the oncall review queue. Thank you!

Copy link
Contributor

@heyronhay heyronhay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments - looks pretty great overall, though!

Copy link
Contributor

github-actions bot commented May 28, 2025

📝 Documentation Team Review Required

This pull request requires approval from the @DataDog/documentation team before it can be merged.

Please ensure your changes follow our documentation guidelines and wait for a team member to review and approve your changes.

Copy link

@michael-richey michael-richey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good

@iadjivon iadjivon requested a review from janine-c June 30, 2025 17:40
Copy link
Contributor

@janine-c janine-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your patience on this review, Ida! This is really coming together. Let me know if you have any questions about my comments or if you want to have a chat! Nailing down the structure for a complicated procedure always takes some thought, and this one is a doozy all put together 🙂

Datadog Agent version **7.54 or above** is required for Datadog Disaster Recovery.

### Supported telemetry types and products
The Agent-based failover description provided on this page supports failover of the following telemetry types and products:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Agent-based failover description provided on this page confused me a little bit. I'm not sure what the description is; it something you can link to? Or maybe you can rephrase to something like The Agent-based failover supports...?


If you're also sending telemetry to Datadog using cloud provider integrations, you must add your cloud provider accounts in the DDR org.

Datadog does not use cloud providers to receive telemetry data while the DDR site is passive.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain what "passive" means here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When it is not actively in failover. But I am confirming with the team. BRB on this.

## Setup
To enable Datadog Disaster Recovery:

1. [Configure Datadog Disaster Recovery](#configure-datadog-disaster-recovery)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the numbering system here a little confusing; each of these headings has a number, but only in this list, and then each of the sub-items has a number as well. So if someone messages their colleague to say something like "I'm stuck on step 3," it might be hard for the colleague to figure out where they are. It could be helpful to consider combining letters and numbers, so someone could refer to "3a" instead or something like that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might put some time into thinking about exactly which steps need to happen in sequence, and thus definitively need numbers associated with them. Particularly in the "Configure your DDR organization" section, it seems as though the accordions can happen in virtually any order. I wonder if it would make sense to just make the headings numbered, and remove numbers from the accordions, so the overall page might feel a little less overwhelming.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very great point!!

Comment on lines 303 to 305
[4]: https://docs.datadoghq.com/getting_started/site/#access-the-datadog-site
[5]: https://github.com/DataDog/datadog-sync-cli/blob/main/README.md
[6]: https://docs.datadoghq.com/logs/log_configuration/attributes_naming_convention/#overview
[7]: https://docs.datadoghq.com/agent/remote_config/?tab=configurationyamlfile
[8]: https://docs.datadoghq.com/api/latest/organizations/#get-organization-information
[9]: https://docs.datadoghq.com/account_management/saml/#overview
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[4]: https://docs.datadoghq.com/getting_started/site/#access-the-datadog-site
[5]: https://github.com/DataDog/datadog-sync-cli/blob/main/README.md
[6]: https://docs.datadoghq.com/logs/log_configuration/attributes_naming_convention/#overview
[7]: https://docs.datadoghq.com/agent/remote_config/?tab=configurationyamlfile
[8]: https://docs.datadoghq.com/api/latest/organizations/#get-organization-information
[9]: https://docs.datadoghq.com/account_management/saml/#overview
[4]: /getting_started/site/#access-the-datadog-site
[5]: https://github.com/DataDog/datadog-sync-cli/blob/main/README.md
[6]: /logs/log_configuration/attributes_naming_convention/#overview
[7]: /agent/remote_config/?tab=configurationyamlfile
[8]: /api/latest/organizations/?code-lang=curl#get-organization-information
[9]: /account_management/saml/#overview

We tend to use relative links for linking within the docs. Absolute links still work, so it's not a huge deal, but I like to keep our approach consistent.

Additionally, for [8], the paragraph that introduces the link says you have to use the cURL command, so this links directly to the cURL tab.


{{% /collapse-content %}}

{{% collapse-content title=" 8. Update your Datadog Agent configuration" level="h5" %}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{{% collapse-content title=" 8. Update your Datadog Agent configuration" level="h5" %}}
{{% collapse-content title=" 8. Send telemetry to your DDR org" level="h5" %}}

I think this title gets to the heart of what the user is trying to accomplish a little better. It can be a little confusing to get a step to update the config towards the end of a process, so this can help clarify that.

Also, this seems like another recommended step that isn't labelled as such?

Lastly, I wasn't sure whether this should say DDR org or DDR site. I see they're both used in this topic, and am not sure if there's a difference? Maybe if they're interchangeable, we should just pick one and stick with it. The latter only appears in a few places, so at least that would be easy 🙂


{{% /collapse-content %}}

{{% collapse-content title=" 10. Activate and test DDR failover in cloud integrations" level="h5" %}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be a mismatch between the title and the content for this accordion. The content doesn't seem to address activating or testing failovers for cloud integrations, so I got confused about what I was supposed to take away. Does it mean you have to contact your CSM or support to be able to do these things? Or can you do them on the landing page? If the latter, it seems odd that there are no instructions?

@iadjivon iadjivon force-pushed the ida.adjivon/DOCS-10575-DiRec-for-okr11 branch from 3ea082d to 0ee88af Compare July 17, 2025 18:27
@github-actions github-actions bot removed the Architecture Everything related to the Doc backend label Jul 22, 2025
Copy link
Contributor Author

@iadjivon iadjivon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added edits! Let me know what you think!
Thanks for this thorough review! 🥰

## Setup
To enable Datadog Disaster Recovery:

1. [Configure Datadog Disaster Recovery](#configure-datadog-disaster-recovery)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very great point!!


If you're also sending telemetry to Datadog using cloud provider integrations, you must add your cloud provider accounts in the DDR org.

Datadog does not use cloud providers to receive telemetry data while the DDR site is passive.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When it is not actively in failover. But I am confirming with the team. BRB on this.

Comment on lines 77 to 73
### Retrieve the public IDs and link your organization
{{% collapse-content title=" 2. Retrieve the public IDs of your orgs and link the DDR org to the primary org" level="h5" %}}

#### Retrieve the public IDs
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modified this, lmk what you

@iadjivon iadjivon requested a review from janine-c July 22, 2025 18:18
Copy link
Contributor

@janine-c janine-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really coming together! The structure makes a lot more sense to me. I have some comments; as always, always happy to chat more about them if you like!



## Prerequisites
Datadog Disaster Recovery requires Datadog Agent version **7.54 or above**. The APM product support requires a **v7.68 or above**.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Datadog Disaster Recovery requires Datadog Agent version **7.54 or above**. The APM product support requires a **v7.68 or above**.
Datadog Disaster Recovery requires Datadog Agent **v7.54+**. The APM product support requires **v7.68+**.

I noticed that we had a couple of different ways of writing out versions, so I tried to standardize it a bit with the table below. If you think it's harder to read, especially with the + in the middle of a sentence, feel free to modify!

Comment on lines 53 to 55
- Go to [Get Started with Datadog](https://app.datadoghq.com/signup)
- Choose a different Datadog site than your primary (for example if you're on `US1`, choose `EU` or `US5`)
- Follow the prompts to create an account
Copy link
Contributor

@janine-c janine-c Jul 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Go to [Get Started with Datadog](https://app.datadoghq.com/signup)
- Choose a different Datadog site than your primary (for example if you're on `US1`, choose `EU` or `US5`)
- Follow the prompts to create an account
1. Go to [Get Started with Datadog](https://app.datadoghq.com/signup). You may need to logout of your current session, or use incognito mode to access this page.
2. Choose a different Datadog site than your primary (for example, if you're on `US1`, choose `EU` or `US5`).
3. Follow the prompts to create an account.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if you see the same thing that I do - if I go to the Get Started with Datadog in a browser where I'm already logged in, I go to the app homepage and not the signup page. And I think users with primary accounts would see the same thing, right? Unless maybe it's on my end?


{{% collapse-content title="Retrieve the public IDs and link your DDR and primary orgs " level="h5" %}}

After the Datadog team has set your DDR org, use the cURL commands from the Datadog [public API endpoint][8] to retrieve the public IDs of the primary and DDR org.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't think you can use this endpoint if you don't have the public ID, since the ID is a path parameter you need to make the call? In other words, if you don't have the public ID, what are you going to do with the API?
image

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@janine-c is correct here, you have to already have your public_id to use this endpoint.

Comment on lines 77 to 79
To link your DDR and primary orgs, run these commands replacing the placeholders for their values:

<div class="alert alert-warning"> For security reasons, Datadog is unable to link the orgs on your behalf. </div>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe switch these two around, so the alert doesn't get in between the intro and the commands?

</div>

#### Verify availability at the DDR site
Verify that your DDR org is accessible and that your Dashboards and Monitors are copied from your primary org to your DDR org.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Verify that your DDR org is accessible and that your Dashboards and Monitors are copied from your primary org to your DDR org.
Verify that your DDR org is accessible and that your dashboards and monitors are copied from your primary org to your DDR org.

Comment on lines 285 to 286

##### Cloud integrations failover
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
##### Cloud integrations failover

Doesn't seem necessary to just have this one heading in the collapsible?


{{% /collapse-content %}}

{{% collapse-content title="Activate and test DDR failover in cloud integrations" level="h5" %}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section doesn't contain information on how to activate and test cloud integrations, which makes it a little confusing. The verbs that indicate what a section contains are very important so users can figure out how the docs align with their objectives. Here, there's a blurb that talks about how failing over cloud integrations is separate in the DDR org, somehow...but I still don't understand what a user is supposed to do with that information if they want to activate and test the failover for their integrations. Do they have to contact their CSM/support for help if they want to test it? If so, we should state that explicitly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michael-richey is there an activation step/button for integrations available in the disaster recovery landing page in the DDR region?

If there is, @janine-c , would adding that clarify the paragraph here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think it could! Basically, the problem I'm trying to avoid here is making users think there's an action to carry out ("ooh, activate and test! I'll feel so much more secure in my approach once I've done that!") and then not give them clear instructions on that action. It looks like we're kind of hinting at something they can do on the disaster recovery landing page, but it's vague and I'm not sure what it really does. Some clarity here about what to expect and what to look for to indicate the test was a success would really help! Otherwise, users might think they've successfully performed the test, but we don't want them to find out that they didn't during an incident!

Copy link

@michael-richey michael-richey Jul 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can walk either of you through this if you want a live demo, but it's confusing and a little dangerous so maybe that's why we're trying to push them to talk to someone? Integrations need to be configured in both Orgs. Then there is literally a button on the disaster recovery landing page that acts as the Frankenstein switch mentioned above. When they hit that button integration telemetry stops going to R1 (that's the dangerous part they really need to understand) and starts going to R2.


<!-- ------------------------------- -->

### 3. Test Run failover tests in various environments
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### 3. Test Run failover tests in various environments
### 3. Test run failover tests in various environments


### 2. Set up access, integrations, syncing, and agents

{{% collapse-content title="Create your Datadog API and App key for syncing" level="h5" %}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized that the step where they create the keys and the step where they use them are kind of far away from each other. I don't know if they're in this order for a specific reason, but I would suggest putting them next to each other if you can:

  • It can be reassuring to a user to use something right after they make it, so the process makes intuitive sense
  • Putting the steps close to each other makes it more likely that they still have the keys open in the window or copied to their clipboard, which makes it more convenient

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
editorial review Waiting on a more in-depth review Guide Content impacting a guide
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants