Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple metagrid node outage #729

Open
durack1 opened this issue Jan 29, 2025 · 6 comments
Open

Multiple metagrid node outage #729

durack1 opened this issue Jan 29, 2025 · 6 comments

Comments

@durack1
Copy link
Contributor

durack1 commented Jan 29, 2025

It seems the LLNL ESGF outage this am was concurrent with ANL and ORNL, though DKRZ was functioning.

Just thinking aloud if there's a metagrid config that needs changing to provide more cross-node resilience to outages?

@sashakames
Copy link
Collaborator

This was a tricky one and the issue of resiliency has become intertwined with Nimbus services, specifically now for the node status. Will keep this open until status is restored as a reminder that we have a workaround in place that disables that functionality.

@durack1
Copy link
Contributor Author

durack1 commented Jan 29, 2025

So there's a nimbus dependency (not even the SOLR index) with https://esgf-node.cels.anl.gov/ and https://esgf-node.ornl.gov, ouch. Would be great to remove that issue asap, these DDoS/heavy requests bringing down the LLNL index are becoming repetitive

@durack1 durack1 changed the title Multiple DOE node outage Multiple metagrid node outage Jan 30, 2025
@durack1
Copy link
Contributor Author

durack1 commented Jan 30, 2025

I was curious where we are today, so checked.

https://aims2.llnl.gov/search
Image

https://esgf-node.ornl.gov/search
Image

https://esgf-node.cels.anl.gov/search
Image

https://esgf-metagrid.cloud.dkrz.de/search
Image

And just because, how the CEDA COG interface is looking
Image

@sashakames
Copy link
Collaborator

  • Turns out we didn't have a DoS, there is something peculiar with the SSL connection attempts between the backend and the API hosted on nimbus that seemed to have brought the server to its knees.
  • Jason may not have the API back up until late Friday or Monday (according to his last communication), but that doesn't seem to prevent searches and download.
  • Now the issue is that there has been a bit of a drift of the versions, and from a UX standpoint, its not great to have such divergence, for instance results automatically load at LLNL.
  • that said my understanding is that @bstrdsmkr will upgrade ORNL very soon.
  • @sturoscy-personal Where do we stand with ANL upgrades of Metagrid?
  • At CDNOT we can discussion international upgrades of frontends.

@durack1
Copy link
Contributor Author

durack1 commented Jan 31, 2025

  • Jason may not have the API back up until late Friday or Monday (according to his last communication), but that doesn't seem to prevent searches and download.

@znichollscr I believe this is the current estimate for nimbus services to be back up and accepting logins - ref esgf-nimbus/nimbus#22

@sashakames
Copy link
Collaborator

With Katharina out sick this week, no movement on the DKRZ upgrade.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants