-
Notifications
You must be signed in to change notification settings - Fork 1.2k
DOCS-10575-DDR #29293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
DOCS-10575-DDR #29293
Conversation
Preview links (active after the
|
Hi @iadjivon! I see you noted this is a work in progress -- I added the "work in progress" label to keep this PR out of the oncall review queue. Thank you! |
…jivon/DOCS-10575-DiRec-for-okr11
…jivon/DOCS-10575-DiRec-for-okr11
…jivon/DOCS-10575-DiRec-for-okr11
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments - looks pretty great overall, though!
📝 Documentation Team Review RequiredThis pull request requires approval from the @DataDog/documentation team before it can be merged. Please ensure your changes follow our documentation guidelines and wait for a team member to review and approve your changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your patience on this review, Ida! This is really coming together. Let me know if you have any questions about my comments or if you want to have a chat! Nailing down the structure for a complicated procedure always takes some thought, and this one is a doozy all put together 🙂
Datadog Agent version **7.54 or above** is required for Datadog Disaster Recovery. | ||
|
||
### Supported telemetry types and products | ||
The Agent-based failover description provided on this page supports failover of the following telemetry types and products: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Agent-based failover description provided on this page
confused me a little bit. I'm not sure what the description is; it something you can link to? Or maybe you can rephrase to something like The Agent-based failover supports...
?
|
||
If you're also sending telemetry to Datadog using cloud provider integrations, you must add your cloud provider accounts in the DDR org. | ||
|
||
Datadog does not use cloud providers to receive telemetry data while the DDR site is passive. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain what "passive" means here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When it is not actively in failover. But I am confirming with the team. BRB on this.
## Setup | ||
To enable Datadog Disaster Recovery: | ||
|
||
1. [Configure Datadog Disaster Recovery](#configure-datadog-disaster-recovery) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found the numbering system here a little confusing; each of these headings has a number, but only in this list, and then each of the sub-items has a number as well. So if someone messages their colleague to say something like "I'm stuck on step 3," it might be hard for the colleague to figure out where they are. It could be helpful to consider combining letters and numbers, so someone could refer to "3a" instead or something like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might put some time into thinking about exactly which steps need to happen in sequence, and thus definitively need numbers associated with them. Particularly in the "Configure your DDR organization" section, it seems as though the accordions can happen in virtually any order. I wonder if it would make sense to just make the headings numbered, and remove numbers from the accordions, so the overall page might feel a little less overwhelming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very great point!!
[4]: https://docs.datadoghq.com/getting_started/site/#access-the-datadog-site | ||
[5]: https://github.com/DataDog/datadog-sync-cli/blob/main/README.md | ||
[6]: https://docs.datadoghq.com/logs/log_configuration/attributes_naming_convention/#overview | ||
[7]: https://docs.datadoghq.com/agent/remote_config/?tab=configurationyamlfile | ||
[8]: https://docs.datadoghq.com/api/latest/organizations/#get-organization-information | ||
[9]: https://docs.datadoghq.com/account_management/saml/#overview |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[4]: https://docs.datadoghq.com/getting_started/site/#access-the-datadog-site | |
[5]: https://github.com/DataDog/datadog-sync-cli/blob/main/README.md | |
[6]: https://docs.datadoghq.com/logs/log_configuration/attributes_naming_convention/#overview | |
[7]: https://docs.datadoghq.com/agent/remote_config/?tab=configurationyamlfile | |
[8]: https://docs.datadoghq.com/api/latest/organizations/#get-organization-information | |
[9]: https://docs.datadoghq.com/account_management/saml/#overview | |
[4]: /getting_started/site/#access-the-datadog-site | |
[5]: https://github.com/DataDog/datadog-sync-cli/blob/main/README.md | |
[6]: /logs/log_configuration/attributes_naming_convention/#overview | |
[7]: /agent/remote_config/?tab=configurationyamlfile | |
[8]: /api/latest/organizations/?code-lang=curl#get-organization-information | |
[9]: /account_management/saml/#overview |
We tend to use relative links for linking within the docs. Absolute links still work, so it's not a huge deal, but I like to keep our approach consistent.
Additionally, for [8]
, the paragraph that introduces the link says you have to use the cURL command, so this links directly to the cURL tab.
|
||
{{% /collapse-content %}} | ||
|
||
{{% collapse-content title=" 8. Update your Datadog Agent configuration" level="h5" %}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{{% collapse-content title=" 8. Update your Datadog Agent configuration" level="h5" %}} | |
{{% collapse-content title=" 8. Send telemetry to your DDR org" level="h5" %}} |
I think this title gets to the heart of what the user is trying to accomplish a little better. It can be a little confusing to get a step to update the config towards the end of a process, so this can help clarify that.
Also, this seems like another recommended step that isn't labelled as such?
Lastly, I wasn't sure whether this should say DDR org
or DDR site
. I see they're both used in this topic, and am not sure if there's a difference? Maybe if they're interchangeable, we should just pick one and stick with it. The latter only appears in a few places, so at least that would be easy 🙂
|
||
{{% /collapse-content %}} | ||
|
||
{{% collapse-content title=" 10. Activate and test DDR failover in cloud integrations" level="h5" %}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to be a mismatch between the title and the content for this accordion. The content doesn't seem to address activating or testing failovers for cloud integrations, so I got confused about what I was supposed to take away. Does it mean you have to contact your CSM or support to be able to do these things? Or can you do them on the landing page? If the latter, it seems odd that there are no instructions?
merge with master
3ea082d
to
0ee88af
Compare
merge with master
…reated but no longer in use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added edits! Let me know what you think!
Thanks for this thorough review! 🥰
## Setup | ||
To enable Datadog Disaster Recovery: | ||
|
||
1. [Configure Datadog Disaster Recovery](#configure-datadog-disaster-recovery) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very great point!!
|
||
If you're also sending telemetry to Datadog using cloud provider integrations, you must add your cloud provider accounts in the DDR org. | ||
|
||
Datadog does not use cloud providers to receive telemetry data while the DDR site is passive. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When it is not actively in failover. But I am confirming with the team. BRB on this.
### Retrieve the public IDs and link your organization | ||
{{% collapse-content title=" 2. Retrieve the public IDs of your orgs and link the DDR org to the primary org" level="h5" %}} | ||
|
||
#### Retrieve the public IDs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modified this, lmk what you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really coming together! The structure makes a lot more sense to me. I have some comments; as always, always happy to chat more about them if you like!
|
||
|
||
## Prerequisites | ||
Datadog Disaster Recovery requires Datadog Agent version **7.54 or above**. The APM product support requires a **v7.68 or above**. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Datadog Disaster Recovery requires Datadog Agent version **7.54 or above**. The APM product support requires a **v7.68 or above**. | |
Datadog Disaster Recovery requires Datadog Agent **v7.54+**. The APM product support requires **v7.68+**. |
I noticed that we had a couple of different ways of writing out versions, so I tried to standardize it a bit with the table below. If you think it's harder to read, especially with the +
in the middle of a sentence, feel free to modify!
- Go to [Get Started with Datadog](https://app.datadoghq.com/signup) | ||
- Choose a different Datadog site than your primary (for example if you're on `US1`, choose `EU` or `US5`) | ||
- Follow the prompts to create an account |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Go to [Get Started with Datadog](https://app.datadoghq.com/signup) | |
- Choose a different Datadog site than your primary (for example if you're on `US1`, choose `EU` or `US5`) | |
- Follow the prompts to create an account | |
1. Go to [Get Started with Datadog](https://app.datadoghq.com/signup). You may need to logout of your current session, or use incognito mode to access this page. | |
2. Choose a different Datadog site than your primary (for example, if you're on `US1`, choose `EU` or `US5`). | |
3. Follow the prompts to create an account. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if you see the same thing that I do - if I go to the Get Started with Datadog in a browser where I'm already logged in, I go to the app homepage and not the signup page. And I think users with primary accounts would see the same thing, right? Unless maybe it's on my end?
|
||
{{% collapse-content title="Retrieve the public IDs and link your DDR and primary orgs " level="h5" %}} | ||
|
||
After the Datadog team has set your DDR org, use the cURL commands from the Datadog [public API endpoint][8] to retrieve the public IDs of the primary and DDR org. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@janine-c is correct here, you have to already have your public_id
to use this endpoint.
To link your DDR and primary orgs, run these commands replacing the placeholders for their values: | ||
|
||
<div class="alert alert-warning"> For security reasons, Datadog is unable to link the orgs on your behalf. </div> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe switch these two around, so the alert doesn't get in between the intro and the commands?
</div> | ||
|
||
#### Verify availability at the DDR site | ||
Verify that your DDR org is accessible and that your Dashboards and Monitors are copied from your primary org to your DDR org. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Verify that your DDR org is accessible and that your Dashboards and Monitors are copied from your primary org to your DDR org. | |
Verify that your DDR org is accessible and that your dashboards and monitors are copied from your primary org to your DDR org. |
|
||
##### Cloud integrations failover |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
##### Cloud integrations failover |
Doesn't seem necessary to just have this one heading in the collapsible?
|
||
{{% /collapse-content %}} | ||
|
||
{{% collapse-content title="Activate and test DDR failover in cloud integrations" level="h5" %}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section doesn't contain information on how to activate and test cloud integrations, which makes it a little confusing. The verbs that indicate what a section contains are very important so users can figure out how the docs align with their objectives. Here, there's a blurb that talks about how failing over cloud integrations is separate in the DDR org, somehow...but I still don't understand what a user is supposed to do with that information if they want to activate and test the failover for their integrations. Do they have to contact their CSM/support for help if they want to test it? If so, we should state that explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@michael-richey is there an activation step/button for integrations available in the disaster recovery landing page in the DDR region?
If there is, @janine-c , would adding that clarify the paragraph here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think it could! Basically, the problem I'm trying to avoid here is making users think there's an action to carry out ("ooh, activate and test! I'll feel so much more secure in my approach once I've done that!") and then not give them clear instructions on that action. It looks like we're kind of hinting at something they can do on the disaster recovery landing page, but it's vague and I'm not sure what it really does. Some clarity here about what to expect and what to look for to indicate the test was a success would really help! Otherwise, users might think they've successfully performed the test, but we don't want them to find out that they didn't during an incident!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can walk either of you through this if you want a live demo, but it's confusing and a little dangerous so maybe that's why we're trying to push them to talk to someone? Integrations need to be configured in both Orgs. Then there is literally a button on the disaster recovery landing page that acts as the Frankenstein switch mentioned above. When they hit that button integration telemetry stops going to R1 (that's the dangerous part they really need to understand) and starts going to R2.
|
||
<!-- ------------------------------- --> | ||
|
||
### 3. Test Run failover tests in various environments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
### 3. Test Run failover tests in various environments | |
### 3. Test run failover tests in various environments |
|
||
### 2. Set up access, integrations, syncing, and agents | ||
|
||
{{% collapse-content title="Create your Datadog API and App key for syncing" level="h5" %}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized that the step where they create the keys and the step where they use them are kind of far away from each other. I don't know if they're in this order for a specific reason, but I would suggest putting them next to each other if you can:
- It can be reassuring to a user to use something right after they make it, so the process makes intuitive sense
- Putting the steps close to each other makes it more likely that they still have the keys open in the window or copied to their clipboard, which makes it more convenient
What does this PR do? What is the motivation?
New DDR doc.
Editorial Review: https://datadoghq.atlassian.net/browse/DOCS-11215
Merge instructions
Merge readiness:
For Datadog employees:
Merge queue is enabled in this repo. Your branch name MUST follow the
<name>/<description>
convention and include the forward slash (/
). Without this format, your pull request will not pass in CI, the GitLab pipeline will not run, and you won't get a branch preview. Getting a branch preview makes it easier for us to check any issues with your PR, such as broken links.If your branch doesn't follow this format, rename it or create a new branch and PR.
To have your PR automatically merged after it receives the required reviews, add the following PR comment:
Additional notes