Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: add mystery comment #588

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

fix: add mystery comment #588

wants to merge 1 commit into from

Conversation

achingbrain
Copy link
Member

@achingbrain achingbrain commented Dec 10, 2024

Not quite sure why this is necessary but it means the correct file ends up in the docker container when building from scratch.


Investigation from the Slack thread:

Ok, I’ve been digging into this and what I’ve found is that when there’s a cache miss, we build from scratch.

The v1.x and v2.x Makefiles both run docker build in the correct implementation directory:

I have a branch in the test-plans repo called fix/debug-js-build. In the branch it does the following:

  1. cat the test/fixtures/relay.js file before running docker build -
    cat $(PWD)/test/fixtures/relay.ts
  2. cat it during the build after it’s been added to the container -
    RUN cat test/fixtures/relay.ts

If I use this branch to run the ping interop test, the test/fixtures/relay.js file in the container does not have the same content as the one in the context directory (see connectionEncrypters vs connectionEncryption):

Very strange.

The only way I’ve been able to get the test/fixtures/relay.js file to have the correct content is to change it’s name so the v2.x file doesn’t have the same path as the v1.x file. I only get the renamed file in the docker image, not the one with the original name.

Also, if I remove the v1.x folder so only build the v2.x version, the file has the correct contents, so some cache must be getting reused between the v1.x and v2.x builds, though both build from the node:lts image without using any other base images.

I’m at a bit of a loss about how this is possible, perhaps someone with a bit more docker knowledge has some ideas?

@achingbrain achingbrain marked this pull request as ready for review December 10, 2024 08:30
@achingbrain
Copy link
Member Author

@p-shahi @MarcoPolo @galargh this is such a hack - do you have any idea why the wrong file ends up in the container without this?

The error it fixes is this sort of thing: https://github.com/libp2p/test-plans/actions/runs/12243585872/job/34153471788?pr=586#step:3:46235

@MarcoPolo
Copy link
Contributor

Thanks for debugging this. I'll try to tal today.

@p-shahi
Copy link
Member

p-shahi commented Dec 10, 2024

Trying a few things on a branch here: #589

@p-shahi
Copy link
Member

p-shahi commented Dec 11, 2024

I think the cache busting fixes this issue: https://github.com/libp2p/test-plans/actions/runs/12266838240/job/34225761983?pr=589 without needing to add the mystery comment
I don't know the root cause of this issue though

@MarcoPolo
Copy link
Contributor

Which cache are you busting? The testplans one or the docker one?

@MarcoPolo
Copy link
Contributor

Does this reproduce locally or only in CI?

@p-shahi
Copy link
Member

p-shahi commented Dec 11, 2024

I'm busting the cache only for the v2.x image builds. In case the problem is happening because of cached layers between the v1.x and v2.x images
I didn't try reproducing locally

@achingbrain
Copy link
Member Author

achingbrain commented Dec 11, 2024

Which cache are you busting? The testplans one or the docker one?

Docker - the problem only occurs when there's an S3 cache miss and this branch is hit. I guess #589 works because setting the env var before copying the files means the cache for the subsequent steps is invalidated.

FWIW I also tried the --no-cache option to docker build though it doesn't seem to help.

Pruning the layer cache before building also seems to fix the problem and is probably a better approach than this PR though I'm still not sure why it's necessary. It doesn't seem to affect build time much which is something at least.

Does this reproduce locally or only in CI?

I didn't try reproducing locally

Like all the best issues, I could only reproduce this in CI, locally everything works as expected 😭

@p-shahi
Copy link
Member

p-shahi commented Dec 16, 2024

@MarcoPolo @achingbrain since this is still causing issues in rust-libp2p and probably elsewhere, what should we do here. Merge #590 and create an issue with a reminder to investigate further?

@galargh do you have any thoughts?

@galargh
Copy link
Contributor

galargh commented Dec 17, 2024

Sorry I missed this earlier, I'll try to have a look later today.

@p-shahi
Copy link
Member

p-shahi commented Dec 23, 2024

@galargh any progress?

@galargh
Copy link
Contributor

galargh commented Dec 26, 2024

I'm honestly baffled by this 🤯

I wasn't able to reproduce it myself by loosely following the instructions outlined in the description. So I reverted to using exactly the same commits for js-libp2p and test-plans as in the failing run.

The failing run I was trying to reproduce: https://github.com/libp2p/js-libp2p/actions/runs/12235431303/attempts/8
My reproduction of that run: https://github.com/libp2p/js-libp2p/actions/runs/12497472405
The js-libp2p branch used: https://github.com/libp2p/js-libp2p/tree/libp2p-test-plans-pulls-588
The test-plans branch used: https://github.com/libp2p/test-plans/tree/libp2p-test-plans-pulls-588

Unfortunately, now it just works. Technically, it failed, but for an unrelated reason after making it past the point that caused the problems before.

The only thing that comes to mind is that somehow the context sent to the build of v1 gets mixed up with the context used during the build of v2. But how?! I don't think I've seen something like that happening before.

I've merged the layer pruning fix - #590 - as it seems like a more sensible option.

I'll definitely keep thinking about it, but for now I don't have any good ideas to follow to be honest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants