-
Notifications
You must be signed in to change notification settings - Fork 32
WIP: Staged layer creation #378
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
✅ A new PR has been created in buildah to vendor these changes: containers/buildah#6414 |
|
Podman PR containers/podman#27251 and the buildah test PR containers/buildah#6414 from the bot both look good so that means we can remove the special case from ApplyDiff() in overlay I think, ref containers/podman#25862 (comment) I still need to work on the actual feature here though to extract while the store in unlocked. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK, simplifying ApplyDiff this way does look correct. (I didn’t carefully look at the tempdir addition yet.)
348a11e to
b7780f2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m mostly looking because I was curious — feel free to disregard.
The tar-split comment might explain some of the “unexpected EOF” test failures.
storage/layers.go
Outdated
| applyDiffTemporaryDriver, ok := r.driver.(drivers.ApplyDiffStaging) | ||
| if ok && diff != nil { | ||
| // CRITICAL, this releases the lock so we can extract this unlocked | ||
| r.stopWriting() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This kind of design rather worries me; it’s not transparent to callers who just see “// Requires startWriting.” in the documentation and assume that if they obtained a startWriting lock, their state will not change by the call to this create. It’s hard to reason about.
Conceptually, I think the overlay driver doesn’t really need to know the precise layer ID for a newly-created layer in order to determine the right getTempDirRoot, if this caller assures the driver that the ID is fresh and not conflicting with anything. (For image layers, the ID is deterministic, and we check that it doesn’t exist before trying to pull; but a concurrent process might create it before we finish, so conflicts can and do occur, and need to be carefully considered.) In such a design, I think most of the code in create before this point does not strictly need to run before the applyDiffUnlocked, but also I didn’t carefully read/check everything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I might have been to focused to make proper quick ID and name conflict lookup first before doing the expensive lookup to "fail fast" when possible.
I guess design wise it makes sense to push this all the way up the stack. I do agree that unlock/lock patter is quite dangerous and I have seen it fail to many times in podman already so if we can avoid it then we should do that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we probably want the “ID already exists” check to exist when creating image layers — so on the substance of the thing, this might be ~exactly right already.
Shaping the call stack is a maintainability concern that is really only worth worrying about after the code works.
I’m mentioning this early mostly in hope that it might avoid work on “perfect” implementation of the current approach, some of which would need to be re-done afterwards; and because the “give me a staging directory for an a future layer, I don’t know the ID yet” method would be a new concept not currently existing in the driver API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW we can make the current design work — rename create to createTemporarilyUnlockingLock or something like that.
b7780f2 to
bbb2266
Compare
|
@mtrmac FYI I have not really addressed most of your comments yet, I am just trying to push things to see how much things break. Still seeing plenty of test failures. Issue 1 I see is that I just use the 700 permission from the tmpdir due the rename instead of the proper diff dir creation permissions that are in the driver.create() code Not sure if I should expose that into the tmpdir creation logic, I guess that makes the most sense since only the dirver should now the exact permission that should be used? Second problem I see are timeouts (in parallel running tests) which I guess mean I added a deadlock situation? I guess looking at the code this unlock/lock again thing I did is indeed completely broken and unsafe due ABBA deadlock, i.e. in putlayer we also hold the containerStore lock so only unlocking the layer store makes it possible that another process can get the layer lock and then blocks on the still gold container store thus both process handing forever. |
I think that could work. I was thinking
Per the locking hierarchy documented at the top of |
Yeah my thinking was that the callback provides a "lifetime" of when the path is safe to use, if I return a string/struct with the path then the caller can cleanup/commit and then still use the path afterwards. This is really where I start to hate go because in rust this would be trivial to enforce so that there could only ever be one call to commit and then render the object useless afterwards. But yes usage wise this callback is indeed getting quite ugly to the point where just returning the path is much simpler and well how go works in general. I do like the suggestion of just returning the path to consolidate both tmpdir functions into one so I will go with that. |
Add a new function to stage additions. This should be used to extract the layer content into a temp directory without holding the storage lock and then under the lock just rename the directory into the final location to reduce the lock contention. Signed-off-by: Paul Holzinger <[email protected]>
It is not clear to me when it will hit the code path there, by normal layer creation we always pass a valid parent so this branch is never reached AFAICT. Let's remove it and see if all tests still pass in podman, buildah and others... Signed-off-by: Paul Holzinger <[email protected]>
bbb2266 to
74d0e97
Compare
|
@mtrmac I will push this into podman and run more tests tomorrow but I think like this it should be workable now. I fix the minor lint issues here of course on the next push. Let me know if this approach seem right to you, I guess the code could need some more better comments/function names likely. |
74d0e97 to
5cf326c
Compare
Add a function to apply the diff into a tmporary directory so we can do that unlcoked and only rename under the lock. Signed-off-by: Paul Holzinger <[email protected]>
I cannot see any reason why we should buffer the full tar split content in memory before writing it. That layer is still mark partial at this point and the store is locked so there is no concurrent access either thus we do not need the atomic rename here. Signed-off-by: Paul Holzinger <[email protected]>
Split it into multiple function to make it reusable without having a layer and so that it can be used unlocked see the following commits. Signed-off-by: Paul Holzinger <[email protected]>
The extracting of the tar under the store lock is a bottleneck as many concurrent processes might hold the locks for a long time on big layers. To address this move the layer extraction before we take the locks if possible. Currently this only work when using the overlay driver as the implementation requires driver specifc details in order for a rename() to work. Signed-off-by: Paul Holzinger <[email protected]>
5cf326c to
e60d339
Compare
|
Ok last issue noticed in podman the idmapping logic cannot be implemented unlocked I fear. We have this code in putLayer() if options.HostUIDMapping {
options.UIDMap = nil
}
if options.HostGIDMapping {
options.GIDMap = nil
}
uidMap := options.UIDMap
gidMap := options.GIDMap
if parent != "" {
var ilayer *Layer
for _, l := range append([]roLayerStore{rlstore}, rlstores...) {
lstore := l
if lstore != rlstore {
if err := lstore.startReading(); err != nil {
return nil, -1, err
}
defer lstore.stopReading()
}
if l, err := lstore.Get(parent); err == nil && l != nil {
ilayer = l
parent = ilayer.ID
break
}
}
if ilayer == nil {
return nil, -1, ErrLayerUnknown
}
parentLayer = ilayer
if err := s.containerStore.startWriting(); err != nil {
return nil, -1, err
}
defer s.containerStore.stopWriting()
containers, err := s.containerStore.Containers()
if err != nil {
return nil, -1, err
}
for _, container := range containers {
if container.LayerID == parent {
return nil, -1, ErrParentIsContainer
}
}
if !options.HostUIDMapping && len(options.UIDMap) == 0 {
uidMap = ilayer.UIDMap
}
if !options.HostGIDMapping && len(options.GIDMap) == 0 {
gidMap = ilayer.GIDMap
}
} else {
// FIXME? It’s unclear why we are holding containerStore locked here at all
// (and because we are not modifying it, why it is a write lock, not a read lock).
if err := s.containerStore.startWriting(); err != nil {
return nil, -1, err
}
defer s.containerStore.stopWriting()
if !options.HostUIDMapping && len(options.UIDMap) == 0 {
uidMap = s.uidMap
}
if !options.HostGIDMapping && len(options.GIDMap) == 0 {
gidMap = s.gidMap
}
}
if s.canUseShifting(uidMap, gidMap) {
options.IDMappingOptions = types.IDMappingOptions{HostUIDMapping: true, HostGIDMapping: true, UIDMap: nil, GIDMap: nil}
} else {
options.IDMappingOptions = types.IDMappingOptions{
HostUIDMapping: options.HostUIDMapping,
HostGIDMapping: options.HostGIDMapping,
UIDMap: copySlicePreferringNil(uidMap),
GIDMap: copySlicePreferringNil(gidMap),
}
}However we extract the layer with the caller specified I guess the s.uidMap case could be done without a lock but not the parent lookups? So I guess the simple solution would be to only use the unlocked extract path when options.HostGIDMapping is true. |
|
(I didn’t look at the current code in this PR yet.) AFAICS IIRC the plan was to start layer creation with an ID lookup (so that we don’t start expensively staging it if it already exists), so the parent lookup could be done within the same lock scope. |
| Mappings: idtools.NewIDMappingsFromMaps(layerOptions.UIDMap, layerOptions.GIDMap), | ||
| // FIXME: What to do here? We have no lock and assigned label yet. | ||
| // Overlayfs should not need it anyway so this seems fine for now. | ||
| MountLabel: "", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is correct
| return os.OpenFile(r.tspath(layerID), os.O_CREATE|os.O_WRONLY|os.O_TRUNC, 0o600) | ||
| } | ||
|
|
||
| // newMaybeStagedLayerExtraction initlaizes a new maybeStagedLayerExtraction. The caller |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo in initlaizes
|
|
||
| sa, err := t.StageAddition() | ||
| if err != nil { | ||
| return nil, nil, -1, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing t.Cleanup?
| return err | ||
| } | ||
| defer tarSplitFile.Close() | ||
| tarSplitWriter := pools.BufioWriter32KPool.Get(tarSplitFile) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as above
| defragmented = io.TeeReader(defragmented, compressedCounter) | ||
|
|
||
| tsdata := bytes.Buffer{} | ||
| tarSplitWriter := pools.BufioWriter32KPool.Get(tarSplitFile) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when will it be released with Put?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading this commit by commit, this looks really great — the comments about documenting locking semantics etc. are basically the final polish.
(Note to self: Eventually it might be worth re-reading the final state as is, to check whether there is any opportunity to simplify.)
Around #378 (comment) and more recently with the parent’s mapping there was some tentative discussion about checking whether the layer exists before deciding to stage it — that’s still to be decided, I think. (In c/image, commitLayer does do a layer presence check before deciding to create it — but in case it is reusing an existing local layer by extracting it into a temporary tarball to be applied, there is still quite a window in which the layer could be concurrently created. Of course, c/image can add one more check to its caller — but if we happened to take locks to read the parent’s state, a lookup for an ID already existing would be ~free.)
| } | ||
|
|
||
| // StageAddition creates a new temporary path that is returned as field in the StageAddition | ||
| // struct. The returned type has a type a the Commit() function to move the content from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
| // The caller MUST ensure .Cleanup() is called after Commit() otherwise the staged content | ||
| // will be deleted and the move will fail. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn’t look correct: After a Commit, the staged content is moved away and nothing happens to it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the wording is confusing, what I tried to say is Commit() must be called before Cleanup() on the TempDir otherwise the staged content is removed and commit fails. Oh how I would love to have this written with rust lifetimes.
|
|
||
| // StageAddition is a temporary object which holds the information of where to | ||
| // put the data into and then use Commit() to move the data into the final location. | ||
| type StageAddition struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Maybe StagedAddition)
|
|
||
| // CommitFunc is a function type that can be returned by operations | ||
| // which need to perform the commit operation later. | ||
| type CommitFunc func(destination string) error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Is anything using this?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, it was leftover from a previous iteration
| } | ||
|
|
||
| // ApplyDiff applies the new layer into a root | ||
| func (d *Driver) ApplyDiff(id, parent string, options graphdriver.ApplyDiffOpts) (size int64, err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Absolutely non-blocking: BTW it seems that nothing is using the parent field any more… we could drop it, OTOH the API stability promise of drivers.Driver is … undefined … or not strictly followed. Maybe worth at least adding a comment in the Driver interface.)
| } | ||
| }() | ||
| // FIXME: type case should be safe for now but really there should be a better way to do this | ||
| err = m.stageWithUnlockedStore(rlstore.(*layerStore), lOptions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A dumb and inelegant way to do this would be to switch the parameters, rlstore.stage…(m, …).
(I also wouldn’t mind too much getting rid of the interfaces entirely… OTOH I was searching for some way to make the layer store kind and lock holding state visible to the type system — I couldn’t find a practical one with Go; still, getting rid of the interfaces would be a move in the opposite direction.)
| var slo *stagedLayerOptions | ||
|
|
||
| if diff != nil { | ||
| m := newMaybeStagedLayerExtraction(diff, s.graphDriver) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Formally, this is racy WRT updates of s.graphDriver … while using rlstore.driver is not.
| // CommitFunc is only set when there is no error returned and the int64 value returns the size of the layer. | ||
| // | ||
| // This API is experimental and can be changed without bumping the major version number. | ||
| func (d *Driver) StartStagingDiffToApply(options graphdriver.ApplyDiffOpts) (tempdir.CleanupTempDirFunc, *tempdir.StageAddition, int64, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A warning that this can run concurrently with any other operations on the driver would be nice (both here and in the interface definition).
| } | ||
|
|
||
| // ApplyDiff applies the new layer into a root | ||
| func (d *Driver) applyDiff(target string, options graphdriver.ApplyDiffOpts) (size int64, err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A warning that this can run concurrently with any other operations on the driver would be nice.
… and that might motivate auditing and documenting which fields of overlay.Driver are immutable after construction.
| } | ||
| }() | ||
| // FIXME: type case should be safe for now but really there should be a better way to do this | ||
| err = m.stageWithUnlockedStore(rlstore.(*layerStore), lOptions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#378 (comment) talks about ID mapping values from the parent layer — if I’m reading the code right, that does not yet happen.
No description provided.