Skip to content

Conversation

@Luap99
Copy link
Member

@Luap99 Luap99 commented Oct 8, 2025

No description provided.

@github-actions github-actions bot added the storage Related to "storage" package label Oct 8, 2025
podmanbot pushed a commit to podmanbot/buildah that referenced this pull request Oct 8, 2025
@podmanbot
Copy link

✅ A new PR has been created in buildah to vendor these changes: containers/buildah#6414

@Luap99
Copy link
Member Author

Luap99 commented Oct 8, 2025

Podman PR containers/podman#27251 and the buildah test PR containers/buildah#6414 from the bot both look good so that means we can remove the special case from ApplyDiff() in overlay I think, ref containers/podman#25862 (comment)

I still need to work on the actual feature here though to extract while the store in unlocked.

Copy link
Contributor

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK, simplifying ApplyDiff this way does look correct. (I didn’t carefully look at the tempdir addition yet.)

@Luap99 Luap99 force-pushed the staged-layer-creation branch from 348a11e to b7780f2 Compare October 13, 2025 13:14
podmanbot pushed a commit to podmanbot/buildah that referenced this pull request Oct 13, 2025
Copy link
Contributor

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m mostly looking because I was curious — feel free to disregard.

The tar-split comment might explain some of the “unexpected EOF” test failures.

applyDiffTemporaryDriver, ok := r.driver.(drivers.ApplyDiffStaging)
if ok && diff != nil {
// CRITICAL, this releases the lock so we can extract this unlocked
r.stopWriting()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of design rather worries me; it’s not transparent to callers who just see “// Requires startWriting.” in the documentation and assume that if they obtained a startWriting lock, their state will not change by the call to this create. It’s hard to reason about.

Conceptually, I think the overlay driver doesn’t really need to know the precise layer ID for a newly-created layer in order to determine the right getTempDirRoot, if this caller assures the driver that the ID is fresh and not conflicting with anything. (For image layers, the ID is deterministic, and we check that it doesn’t exist before trying to pull; but a concurrent process might create it before we finish, so conflicts can and do occur, and need to be carefully considered.) In such a design, I think most of the code in create before this point does not strictly need to run before the applyDiffUnlocked, but also I didn’t carefully read/check everything.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I might have been to focused to make proper quick ID and name conflict lookup first before doing the expensive lookup to "fail fast" when possible.
I guess design wise it makes sense to push this all the way up the stack. I do agree that unlock/lock patter is quite dangerous and I have seen it fail to many times in podman already so if we can avoid it then we should do that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we probably want the “ID already exists” check to exist when creating image layers — so on the substance of the thing, this might be ~exactly right already.

Shaping the call stack is a maintainability concern that is really only worth worrying about after the code works.

I’m mentioning this early mostly in hope that it might avoid work on “perfect” implementation of the current approach, some of which would need to be re-done afterwards; and because the “give me a staging directory for an a future layer, I don’t know the ID yet” method would be a new concept not currently existing in the driver API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW we can make the current design work — rename create to createTemporarilyUnlockingLock or something like that.

@Luap99 Luap99 force-pushed the staged-layer-creation branch from b7780f2 to bbb2266 Compare October 15, 2025 14:59
podmanbot pushed a commit to podmanbot/buildah that referenced this pull request Oct 15, 2025
@Luap99
Copy link
Member Author

Luap99 commented Oct 15, 2025

@mtrmac FYI I have not really addressed most of your comments yet, I am just trying to push things to see how much things break. Still seeing plenty of test failures.

Issue 1 I see is that I just use the 700 permission from the tmpdir due the rename instead of the proper diff dir creation permissions that are in the driver.create() code

	diff := path.Join(dir, "diff")
	if err := idtools.MkdirAs(diff, forcedSt.Mode, forcedSt.IDs.UID, forcedSt.IDs.GID); err != nil {
		return err
	}

	if d.options.forceMask != nil {
		st.Mode |= os.ModeDir
		if err := idtools.SetContainersOverrideXattr(diff, st); err != nil {
			return err
		}
	}

Not sure if I should expose that into the tmpdir creation logic, I guess that makes the most sense since only the dirver should now the exact permission that should be used?


Second problem I see are timeouts (in parallel running tests) which I guess mean I added a deadlock situation?
https://api.cirrus-ci.com/v1/artifact/task/5906611702595584/html/sys-podman-debian-13-rootless-host-sqlite.log.html

I guess looking at the code this unlock/lock again thing I did is indeed completely broken and unsafe due ABBA deadlock, i.e. in putlayer we also hold the containerStore lock so only unlocking the layer store makes it possible that another process can get the layer lock and then blocks on the still gold container store thus both process handing forever.
I haven't checked all the code paths but I guess with the locking order requirements what I did is basically impossible to achieve anyway and I have to indeed move this out to before we get the lock?

@mtrmac
Copy link
Contributor

mtrmac commented Oct 15, 2025

Issue 1 I see is that I just use the 700 permission from the tmpdir due the rename instead of the proper diff dir creation permissions that are in the driver.create() code

Not sure if I should expose that into the tmpdir creation logic, I guess that makes the most sense since only the dirver should now the exact permission that should be used?

I think that could work.

I was thinking StageAddition does not actually need to create (os.Create/os.Mkdir) the tmpAddPath at all. All of that happens inside a lock-protected td.tempDirPath, so there is ~nothing special, that I can see, about populating tmpAddPath — the provided callback can create the staged item without any help. (That could also mean StageDirectoryAddition and StageFileAddition could be consolidated into one. And I’m not immediately sure we need a callbackStageAddition could return a newStagingPathToPopulate — but I also didn’t now carefully re-read the tempdir package.)


I haven't checked all the code paths but I guess with the locking order requirements what I did is basically impossible to achieve anyway and I have to indeed move this out to before we get the lock?

Per the locking hierarchy documented at the top of store, I think you’re right here.

@Luap99
Copy link
Member Author

Luap99 commented Oct 15, 2025

And I’m not immediately sure we need a callback — StageAddition could return a newStagingPathToPopulate — but I also didn’t now carefully re-read the tempdir package.)

Yeah my thinking was that the callback provides a "lifetime" of when the path is safe to use, if I return a string/struct with the path then the caller can cleanup/commit and then still use the path afterwards. This is really where I start to hate go because in rust this would be trivial to enforce so that there could only ever be one call to commit and then render the object useless afterwards.

But yes usage wise this callback is indeed getting quite ugly to the point where just returning the path is much simpler and well how go works in general. I do like the suggestion of just returning the path to consolidate both tmpdir functions into one so I will go with that.

Add a new function to stage additions. This should be used to extract
the layer content into a temp directory without holding the storage
lock and then under the lock just rename the directory into the final
location to reduce the lock contention.

Signed-off-by: Paul Holzinger <[email protected]>
It is not clear to me when it will hit the code path there, by normal
layer creation we always pass a valid parent so this branch is never
reached AFAICT.

Let's remove it and see if all tests still pass in podman, buildah and
others...

Signed-off-by: Paul Holzinger <[email protected]>
@Luap99 Luap99 force-pushed the staged-layer-creation branch from bbb2266 to 74d0e97 Compare October 22, 2025 18:42
podmanbot pushed a commit to podmanbot/buildah that referenced this pull request Oct 22, 2025
@Luap99
Copy link
Member Author

Luap99 commented Oct 22, 2025

@mtrmac I will push this into podman and run more tests tomorrow but I think like this it should be workable now. I fix the minor lint issues here of course on the next push. Let me know if this approach seem right to you, I guess the code could need some more better comments/function names likely.

@Luap99 Luap99 force-pushed the staged-layer-creation branch from 74d0e97 to 5cf326c Compare October 23, 2025 10:31
podmanbot pushed a commit to podmanbot/buildah that referenced this pull request Oct 23, 2025
Add a function to apply the diff into a tmporary directory so we can do
that unlcoked and only rename under the lock.

Signed-off-by: Paul Holzinger <[email protected]>
I cannot see any reason why we should buffer the full tar split content
in memory before writing it. That layer is still mark partial at this
point and the store is locked so there is no concurrent access either
thus we do not need the atomic rename here.

Signed-off-by: Paul Holzinger <[email protected]>
Split it into multiple function to make it reusable without having a
layer and so that it can be used unlocked see the following commits.

Signed-off-by: Paul Holzinger <[email protected]>
The extracting of the tar under the store lock is a bottleneck as many
concurrent processes might hold the locks for a long time on big layers.

To address this move the layer extraction before we take the locks if
possible. Currently this only work when using the overlay driver as the
implementation requires driver specifc details in order for a rename()
to work.

Signed-off-by: Paul Holzinger <[email protected]>
@Luap99 Luap99 force-pushed the staged-layer-creation branch from 5cf326c to e60d339 Compare October 23, 2025 12:15
podmanbot pushed a commit to podmanbot/buildah that referenced this pull request Oct 23, 2025
@Luap99
Copy link
Member Author

Luap99 commented Oct 23, 2025

Ok last issue noticed in podman the idmapping logic cannot be implemented unlocked I fear.

We have this code in putLayer()

	if options.HostUIDMapping {
		options.UIDMap = nil
	}
	if options.HostGIDMapping {
		options.GIDMap = nil
	}
	uidMap := options.UIDMap
	gidMap := options.GIDMap
	if parent != "" {
		var ilayer *Layer
		for _, l := range append([]roLayerStore{rlstore}, rlstores...) {
			lstore := l
			if lstore != rlstore {
				if err := lstore.startReading(); err != nil {
					return nil, -1, err
				}
				defer lstore.stopReading()
			}
			if l, err := lstore.Get(parent); err == nil && l != nil {
				ilayer = l
				parent = ilayer.ID
				break
			}
		}
		if ilayer == nil {
			return nil, -1, ErrLayerUnknown
		}
		parentLayer = ilayer

		if err := s.containerStore.startWriting(); err != nil {
			return nil, -1, err
		}
		defer s.containerStore.stopWriting()
		containers, err := s.containerStore.Containers()
		if err != nil {
			return nil, -1, err
		}
		for _, container := range containers {
			if container.LayerID == parent {
				return nil, -1, ErrParentIsContainer
			}
		}
		if !options.HostUIDMapping && len(options.UIDMap) == 0 {
			uidMap = ilayer.UIDMap
		}
		if !options.HostGIDMapping && len(options.GIDMap) == 0 {
			gidMap = ilayer.GIDMap
		}
	} else {
		// FIXME? It’s unclear why we are holding containerStore locked here at all
		// (and because we are not modifying it, why it is a write lock, not a read lock).
		if err := s.containerStore.startWriting(); err != nil {
			return nil, -1, err
		}
		defer s.containerStore.stopWriting()

		if !options.HostUIDMapping && len(options.UIDMap) == 0 {
			uidMap = s.uidMap
		}
		if !options.HostGIDMapping && len(options.GIDMap) == 0 {
			gidMap = s.gidMap
		}
	}
	if s.canUseShifting(uidMap, gidMap) {
		options.IDMappingOptions = types.IDMappingOptions{HostUIDMapping: true, HostGIDMapping: true, UIDMap: nil, GIDMap: nil}
	} else {
		options.IDMappingOptions = types.IDMappingOptions{
			HostUIDMapping: options.HostUIDMapping,
			HostGIDMapping: options.HostGIDMapping,
			UIDMap:         copySlicePreferringNil(uidMap),
			GIDMap:         copySlicePreferringNil(gidMap),
		}
	}

However we extract the layer with the caller specified layerOptions.UIDMap, layerOptions.GIDMap in stageWithUnlockedStore() before this code can get run.

I guess the s.uidMap case could be done without a lock but not the parent lookups? So I guess the simple solution would be to only use the unlocked extract path when options.HostGIDMapping is true.

@mtrmac
Copy link
Contributor

mtrmac commented Oct 23, 2025

(I didn’t look at the current code in this PR yet.) AFAICS layer.[UG]IDMap never changes once the layer is created, so it should be safe to look up the parent, read these values, and then unlock.

IIRC the plan was to start layer creation with an ID lookup (so that we don’t start expensively staging it if it already exists), so the parent lookup could be done within the same lock scope.

Mappings: idtools.NewIDMappingsFromMaps(layerOptions.UIDMap, layerOptions.GIDMap),
// FIXME: What to do here? We have no lock and assigned label yet.
// Overlayfs should not need it anyway so this seems fine for now.
MountLabel: "",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is correct

return os.OpenFile(r.tspath(layerID), os.O_CREATE|os.O_WRONLY|os.O_TRUNC, 0o600)
}

// newMaybeStagedLayerExtraction initlaizes a new maybeStagedLayerExtraction. The caller
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo in initlaizes


sa, err := t.StageAddition()
if err != nil {
return nil, nil, -1, err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing t.Cleanup?

return err
}
defer tarSplitFile.Close()
tarSplitWriter := pools.BufioWriter32KPool.Get(tarSplitFile)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above

defragmented = io.TeeReader(defragmented, compressedCounter)

tsdata := bytes.Buffer{}
tarSplitWriter := pools.BufioWriter32KPool.Get(tarSplitFile)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when will it be released with Put?

Copy link
Contributor

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading this commit by commit, this looks really great — the comments about documenting locking semantics etc. are basically the final polish.

(Note to self: Eventually it might be worth re-reading the final state as is, to check whether there is any opportunity to simplify.)

Around #378 (comment) and more recently with the parent’s mapping there was some tentative discussion about checking whether the layer exists before deciding to stage it — that’s still to be decided, I think. (In c/image, commitLayer does do a layer presence check before deciding to create it — but in case it is reusing an existing local layer by extracting it into a temporary tarball to be applied, there is still quite a window in which the layer could be concurrently created. Of course, c/image can add one more check to its caller — but if we happened to take locks to read the parent’s state, a lookup for an ID already existing would be ~free.)

}

// StageAddition creates a new temporary path that is returned as field in the StageAddition
// struct. The returned type has a type a the Commit() function to move the content from
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

Comment on lines +221 to +222
// The caller MUST ensure .Cleanup() is called after Commit() otherwise the staged content
// will be deleted and the move will fail.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn’t look correct: After a Commit, the staged content is moved away and nothing happens to it here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the wording is confusing, what I tried to say is Commit() must be called before Cleanup() on the TempDir otherwise the staged content is removed and commit fails. Oh how I would love to have this written with rust lifetimes.


// StageAddition is a temporary object which holds the information of where to
// put the data into and then use Commit() to move the data into the final location.
type StageAddition struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Maybe StagedAddition)


// CommitFunc is a function type that can be returned by operations
// which need to perform the commit operation later.
type CommitFunc func(destination string) error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Is anything using this?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, it was leftover from a previous iteration

}

// ApplyDiff applies the new layer into a root
func (d *Driver) ApplyDiff(id, parent string, options graphdriver.ApplyDiffOpts) (size int64, err error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Absolutely non-blocking: BTW it seems that nothing is using the parent field any more… we could drop it, OTOH the API stability promise of drivers.Driver is … undefined … or not strictly followed. Maybe worth at least adding a comment in the Driver interface.)

}
}()
// FIXME: type case should be safe for now but really there should be a better way to do this
err = m.stageWithUnlockedStore(rlstore.(*layerStore), lOptions)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A dumb and inelegant way to do this would be to switch the parameters, rlstore.stage…(m, …).

(I also wouldn’t mind too much getting rid of the interfaces entirely… OTOH I was searching for some way to make the layer store kind and lock holding state visible to the type system — I couldn’t find a practical one with Go; still, getting rid of the interfaces would be a move in the opposite direction.)

var slo *stagedLayerOptions

if diff != nil {
m := newMaybeStagedLayerExtraction(diff, s.graphDriver)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formally, this is racy WRT updates of s.graphDriver … while using rlstore.driver is not.

// CommitFunc is only set when there is no error returned and the int64 value returns the size of the layer.
//
// This API is experimental and can be changed without bumping the major version number.
func (d *Driver) StartStagingDiffToApply(options graphdriver.ApplyDiffOpts) (tempdir.CleanupTempDirFunc, *tempdir.StageAddition, int64, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A warning that this can run concurrently with any other operations on the driver would be nice (both here and in the interface definition).

}

// ApplyDiff applies the new layer into a root
func (d *Driver) applyDiff(target string, options graphdriver.ApplyDiffOpts) (size int64, err error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A warning that this can run concurrently with any other operations on the driver would be nice.

… and that might motivate auditing and documenting which fields of overlay.Driver are immutable after construction.

}
}()
// FIXME: type case should be safe for now but really there should be a better way to do this
err = m.stageWithUnlockedStore(rlstore.(*layerStore), lOptions)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#378 (comment) talks about ID mapping values from the parent layer — if I’m reading the code right, that does not yet happen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

storage Related to "storage" package

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants