Skip to content

Conversation

@dcmcand
Copy link
Contributor

@dcmcand dcmcand commented Mar 28, 2025

Reference Issues or PRs

Closes #2952

What does this implement/fix?

Moves nodes to private subnets and removes the autoassign public IP option.

Currently our nodes are placed in public subnets with a public ip assigned by default. This is a security vulnerability that gives us no benefit whatsoever. The new setup places all nodes in a private subnet while keeping load balancers in public subnets. This will still allow public access to nebari, but you will not be able to access the nodes themselves over the public internet anymore.

The following illustration is from the AWS documentation (https://docs.aws.amazon.com/eks/latest/best-practices/subnets.html) and shows the new setup. Note that this is the recommended setup for EKS on AWS.

image

Put a x in the boxes that apply

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds a feature)
  • Breaking change (fix or feature that would cause existing features not to work as expected)
  • Documentation Update
  • Code style update (formatting, renaming)
  • Refactoring (no functional changes, no API changes)
  • Build related changes
  • Other (please describe):

Testing

  • Did you test the pull request locally?
  • Did you add new tests?

How to test this PR?

Deploy Nebari to AWS, in the console validate that the nodes are located in Private subnets, then go through the testing checklist to validate all functionality is unchanged.

Any other comments?

NOTE This will likely result in issues with the general node not restarting if it ends up in a different AZ from it's EBS volume. This is a known issue and needs addressed by changing our storage setup.

@dcmcand dcmcand added provider: AWS impact: high 🟥 This issue affects most of the nebari users or is a critical issue area: security 🔐 area: networking All items related to networking labels Mar 28, 2025
@dcmcand dcmcand moved this from New 🚦 to In review/QA 👀 in 🪴 Nebari Project Management Mar 28, 2025
@marcelovilla
Copy link
Member

NOTE This will likely result in issues with the general node not restarting if it ends up in a different AZ from it's EBS volume. This is a known issue and needs addressed by changing our storage setup.

This issue is outlined in #3008. @dcmcand, @viniciusdc, and I had a discussion regarding this limitation and decided that we'll try to first address #3008 before merging this PR.

@dcmcand dcmcand added needs: review 👀 This PR is complete and ready for reviewing DO-NOT-MERGE status: in review 👀 This PR is currently being reviewed by the team labels Apr 1, 2025
@dcmcand dcmcand marked this pull request as ready for review April 1, 2025 11:14
@dcmcand dcmcand requested a review from a team as a code owner April 1, 2025 11:14
@dcmcand dcmcand requested review from marcelovilla and viniciusdc and removed request for a team April 1, 2025 11:14
@dcmcand
Copy link
Contributor Author

dcmcand commented Apr 1, 2025

Do not merge until #3008 is fixed as this will cause difficulties with upgrades.

description = "VPC cidr number of bits to support 2^N subnets"
type = number
default = 2
default = 2 # allows 4 /18 subnets with 16382 addresses each
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
default = 2 # allows 4 /18 subnets with 16382 addresses each
default = 3 # allows 8 /18 subnets with 16382 addresses each

needed this for my use case with 3 subnets specified

@viniciusdc
Copy link
Contributor

We need the upgrade path on the next release -- (follow-up release), this is the last remaining bit to get this going

yarikoptic pushed a commit to asmacdo/nebari that referenced this pull request Oct 2, 2025
Annotation for our "manual patch queue":

 commit done manually based on @satra's comment at
 nebari-dev#3004 (comment)
@asmacdo
Copy link
Contributor

asmacdo commented Oct 6, 2025

I dont think I can help much with providing an update path but I did want to provide my feedback on using this for a while in production. We eventually dropped this change from our deployment due to high cost of the NAT Gateway usage to move data around.

When we moved back from private to public subnets, our upgrade path was a little bit awkward. Its the "reverse" of what an upgrade path for this one, so just in case its helpful, heres how I managed to move from private->public subnets.

  • nebari hangs on the deletion of a public subnet, so I manually deleted the Elastic Load Balancer and nebari proceeded
  • similarly, nebari tries to delete private subnets, but these are dependent on the EKS cluster which also had to be manually terminated
  • the first run of the deployment eventually fails with a 404 for GET /auth/admin/realms/nebari/default-groups (IIUC this is caused by the removal of the EKS cluster invalidating the keycloak state but tofu doesnt know that
  • in the keycloak configuration stage, tofu state rm keycloak_default_groups.default then redeploy

This upgrade path does lose state though, I had to restore keycloak and conda-store state from backups.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: networking All items related to networking area: security 🔐 DO-NOT-MERGE impact: high 🟥 This issue affects most of the nebari users or is a critical issue needs: review 👀 This PR is complete and ready for reviewing provider: AWS status: in review 👀 This PR is currently being reviewed by the team

Projects

Status: In review/QA 👀

Development

Successfully merging this pull request may close these issues.

Fix code scanning alert - Instances in a subnet should not receive a public IP address by default.

6 participants