-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should create multiple documents with same id throw error? #79
Comments
I may be just pre-coffee, but I'm missing what the illustration here. There If you made a second doc and stored it with the same |
The case I'm wondering about is in MeetingNotes:
Then I have two documents open with same ID, but "I think" only second one is now held by repository. Seems like a weird state to accidentally get into, but right now unless you know details of how everything works, maybe easy to get into when building document based app. I'm wondering if repo code should protect against this state. In particular in the public api calls that result in |
Ahhh, now I've got it. The gist is to disallow the idea that repo.create(id,...) as replacing the existing id - that's probably good. But I think there may be a very legit case here - I think in the MeetingNotes example, I want that same ID with that file, even replicated, because it represents a fork that could be merged. We've lost it from the repo, so if you added editing into your sequence, then (to me) it becomes clear that what should happen on a follow on 'create' is that if it's the same ID, it should be "merged" into place, not just replaced, and if that throws an error (different document bases entirely), then that should propogate up. What do you think? Or would it make more sense to just disallow this pattern altogether (error on creating where there's a doc with the same ID already in the repo?) |
Right... that's a tricky case I hadn't thought through.
To me simplest solution is to throw error in this case. User will still be able to edit both documents. Error will tell user that that second document is not registered in repo, and not syncing, otherwise works fine. To me this seems like correct behavior for a But what if you did want to solve for this problem? Even if you use the For example you would open your old forked document, and then its state would get replaced with the already open document state. Alternatively if on opening second document there was an automatic merge... then still a problem if you didn't want to do that merge, you wanted to keep the fork state. I guess maybe the automatic merge state loss (from users perspective) could be resolved by making documents also store the set of heads they are working from? I haven't really used that API, but I think it might allow both documents to share a common Automerge.Document instance, while still working on one forks? |
There's been some focus and conversation in the Automerge community at large about tracking "forks" of documents, so I'm inquiring about this as it might play out for the JS version of automerge-repo - it seems aligned with our thinking here. A potentially interesting note - if you "clone" a document from a repo, you get the identical document (a fork) with a new ID from the repo, but there's no API concept of a way to merge those tegether in the repo - that's all in Document, if you want to try. The path we're running down is roughly equivalent to forking, except we're doing it well outside of the reach of any consolidation in a repository - so we're replicating docs and IDs. It may be a just a poor match for the example where you have documents stashing the information, where those documents can be replicated infinitely. When you bring them back to the repository, I think it makes sense to want to:
That runs a bit in the face of 1:doc to 1:id and only 1 reference to an ID that the current API provides. It's conceptually enforcing that there's only 1 representation (ideally combined I would suspect) that's valid (within a repository) for a single ID. In the meantime, throwing an error on create with an ID when there's already an ID would make a good baseline that "Hey, something is awry here". |
Actually I don't think this idea would work. From what I can see heads can be used to read a document at point in history, but you can't use them to edit at a specific point. I think this means multiple fork within a single document is not possible. Probably best as relying on heads for proper edits in any NSDocument backed model seems messy anyway. To me this is an error case. You can't open two documents, same ID, different state and expect them to sync without requiring some sort of user action, such as accepting a merge. |
possible PR to resolve (the strict "throw error on ID already in repo" concept): #82 |
I think this does beg for a repo-based API that allows merge - it would be just a light wrapper around |
I'm working on a Cocoa NSDocument based counter app to try to figure this stuff out. My current thinking is that Repo/syncing are all extra features and don't need to be tied tightly to NSDocument lifecycle. So if all the repo stuff errors out, should still be able to work with document with minimal fuss. Just somewhere have UI indication that document isn't syncing/offline. Maybe that means it's fine to register on window display? |
FWIW in the JS implementation there is no way to specify the document ID when creating a document, the ID is assigned for you. This is deliberate as we don't want people to be able to create two documents with the same ID but with no history in common. Also, the JS Re. tracking branches, right now that's handled by the application in an ad-hoc manner - typically by storing pointers to the branches in the branched document somewhere. |
For swift implementation it's important since document ids and data might be stored externally in document files, but maybe the API should use different terminology? Right now Repo has:
What if those methods were replaced with:
I don't know if connect is best term... but the idea is you are trying to connect an external document with the given id in the repo. The method should implement some find or create logic. The implementation might look like untested code:
|
I think I get the concept you're proposing - not opposed conceptually, and connect, while an overloaded term, is pretty good for what you're after. A couple of quick questions:
I'm at a loss as to the meaning of "Unavailable" from a remote repo, and how that should reasonably play. The two circumstances that I'm aware of:
|
I can't remember if I posted this somewhere already, but I have been thinking that maybe (to handle these various scenarios) the NSDocument model should maintain two IDs: originalId and forkId. The originalId is immutable and used to check for manual merges. The forkId starts out equal to originalId, and in many cases never changes from it, but the forkId can be regenerated to solve for some of the problematic cases above. In particular:
Maybe? |
If a peer receives a
This suprised me because
That you are talking about a scenario where the The reason I'm asking is because In the case of the |
Short answer to the question above:
Yes - outside of the app even running, the content was duplicated, and then that content opened from disk. In iOS/macOS using NSDocument and equivalents, the framework wants to own loading the data and owning when and how it writes the information. So in both MeetingNotes and what @jessegrosjean is doing, we're using that to provide the binary content for the Automerge document (from The use case in particular is the app loads two of those files from disk, they are forks - have identical IDs, so we know they can merge, and may represent different editions or such - at the current stage we only know they sourced from the same document, and we have two in memory, with a starting position that we have that external to how a repository manages anything. |
I think maybe trying to create a fork automatically because of document copy in the finder is the wrong idea. Instead any document with same id should always sync/merge with any other document found with that same id. From users perspective:
Maybe? |
This seem sensible to me. I've been thinking more about not exposing document IDs to users. Could this be achieved by defining some kind of "document bundle" format which the I've been thinking of introducing a similar bundle format for these usecases in the JS world as well, for similar reasons. |
Are you meaning anywhere in the API, or just removing from repo.create methods? I think it makes sense to remove them from create methods. Generally having read access to IDs in the API seems like it would still be useful sometimes. For example when generating URLs.
I think that a standardized way to export documents from a repo and then load them into another repo would be very useful. If you do decide to create a new bundle format would it make sense for it to contain multiple Document/ID pairs? I think the behavior needed by NSDocument could also be useful for the Javascript version of Automerge. For example say you are working with an Automerge repo in a web browser. It could be nice to have an export command that exports the local repo storage to this bundle format. You could then take this bundle to some other computer that doesn't have network and load the bundle to restore those documents. I think NSDocument would want to use the exact same export/load implementation. Eventually maybe you could also give it options to handle forking documents in a standardized way. Given this I don't think my above
|
Yup apologies for the fuzziness, this is what I meant. Basically it's a bad idea to let users decide which Document IDs are associated with which data, it's very easy to create extremely surprising merge scenarios using this feature. Exposing the document ID for read is useful as you say - although I think we should prefer URLs (in the form Re. multiple document IDs. I can see what you're getting at. What we actually want here though I think is native branch support - which is coming some time this year. This will be an automerge feature which allows a document to have multiple divergent branches which you can select from. I think for now it makes more sense to have a bundle format with a single document ID in it and have forking and merging be an explicit user action at the application level as you suggest. That aside, I think your description of what should happen when loading a bundle is sensible. |
Cool on native branch support! :), but actually I think I'm taking about something else. I just mean a NSDocument might represent a workspace that stores multiple Automerge.Documents. For example maybe a single NSDocument represents a book you are working on. And each chapter is represented by a separate Automerge.Document. If the bundle supported a map of ids to documents then a single bundle could support this case. At the same time even if bundles are one id to one document I can also make it work with a loop over multiple bundles. It doesn't really matter, whichever you think is better I'm for. It's the load logic added to the repo that provides the real value for me. |
Oh I see what you mean and yeah I think you're right, we probably want a bundle format that supports exporting multiple documents from the repo, tagged by ID. I wonder if such a format would also need a hint as to what each document is for. I can imagine that if you load a bundle of documents in you book-authoring app you need to know which document ID is the root document. (This is getting pretty close to what git bundle does. |
I hadn't thought about that, but seems useful. I think for NSDocument case we could also just store that metadata in NSDocument storage, outside the bundle data. But to support the more general case of exporting browser repo state using the bundle format and importing it elsewhere, then it probably becomes more required. Other option being to loop over all imported documents and inspect state. I don't think the repo would know what metadata to store, so the API would need to look something like this, when API client provides the metadata?
|
Well, one way to avoid this problem of providing metadata is to say that the metadata must be an automerge document which links to every other automerge document in the bundle (directly or transitively). A document "links" to another document by containing somewhere within it a string of which looks like an automerge URL. This way, you just provide the "metadata" document (which might be something like a table of contents) to the The API then would look something like:
One of the nice properties of this approach is that bundles can be composed of other bundles (by linking to their root documents) without having to do some application level plumbing to carry the metadata around. Implementing this requires more work on the part of repo though, primarily in that Repo has to scan through all the (transitively) linked documents from the root document looking for automerge URLs. This will get easier as I start implementing collection sync because I will be adding a primitive type for "links" and utility functions for loading them, but right now this might be a bit of schlep. |
This sounds like it could work pretty well for my case. Finding/following links seems a little imprecise, but workable. I think maybe you would need a client provided filter to allow skipping links. For example in the book composed of Automerge.Document paragraphs example... If a paragraph has an automerge link to some other object, you might (or might not) want to embed that content into the bundle. Also you might want some options to indicate what should happen in case of an automerge url that doesn't exist in local repo storage. Should the repo do a find to search peers, or skip if it doesn't exist locally. Anyway, I think this could generally work for NSDocument case. I don't really know if filtering options are needed, just a thought. |
I'm very loathe to remove the ability to import/create a document with an ID. The ID is the only indicator that "this is the same document/documents can be 'safely' merged" that we have. Once they're just blobs on the filesystem with arbitrary human provided names, the questions of "can this merge safely" cant easily be determined, which really bugs me. If we had an API means to verify that two documents shared a base history, that would alleviate a huge amount of this issue - then we could explicitly check, and not have the follow-along data to infer that information from. While it's far less relevant in the JS repo, I'm thinking that Jesse's idea of having a separate find() like mechanism that only works against a filesystem location (loading from a bundle, for example) would be a huge benefit for the native app use case. |
I completely agree that the ID is extremely useful for ensuring that merge behavior makes sense. My concern is that allowing people to specify the document ID actually increases the scope for this occurring and it's why I'm suggesting that the only way to load a document from outside of the repo storage be this bundle format, where the bundle includes the document ID in it so that there is no way of the user accidentally associating the wrong ID with the document. Does the idea of a bundle format address your concerns around provenance? |
Mostly. It certainly does for the bundle "load from filesystem" scenario, and is well supported in the Apple scenarios. The place where it isn't as well supported is in mechanisms like "drag and drop", and the new variations that are appearing on the Apple platform where you have a type identifier linked up with a blob of serialized data - that scenario doesn't give you a means to say "these multiple separate things (id, doc) always go together" like the bundle does. In that scenario, I've been leaning into encoded the doc + id into a single thing, to the point where I formalized that in the API as the type DocHandle I'd be happy to make repo create API take only DocHandle or the data from a doc, as a means of transporting these two values together, as that keeps this identifier alongside, but downplays an API that might encourage you to create a conflicting ID scenario. The URL concept is a bit more fraught in Apple's platforms, as the kind of URL support that Automerge-repo is using is typically bound to a single app, so not at all well supported as a general platform. Apple's platform has basically taken that concept and made it relevant to "deep link" into an application, as opposed to having multiple apps that can all use the same scheme (which is also causing a headache in other areas - RSS in particular - and the whole concept of "default apps", and the lack of control for such a concept in iOS, but, that's a whole separate Apple annoyance issue). The whole need here to keep an id is, in my mind, really only relevant to knowing if a document can be safely merged. To be honest, I'd very much prefer to have a way to remove the need for an ID at all - at least as that primary indicator, and switch entirely to something we can introspect on the data of an Automerge document to know if there's a shared base history. |
Interesting, would you be able to point me at the best thing to read to understand what data you do get in a "drag and drop" scenario? Or better yet, explain how you imagine this currently working in the concrete case of something like MeetingNotes? I don't have a clear picture in my head of what the platform is managing for us and what we are providing. Re. URLs. I don't imagine automerge URLs as being user facing in general. As you say, in practice user facing URLs are typically app specific on most modern platforms. I still think it's worth using automerge URLs internally though. URLs have nice standardised extensibility mechanisms which we will almost certainly need - and occasionally they will be useful as user facing things. To the last point - I agree that we should definitely have tools for determining if two documents have a shared history and we probably will do something like add a native document ID concept. However, there are still lots of cases where you actually want documents with shared history to be handled separately by the persistence and transport mechanisms - e.g. I might have a private version of my meeting notes, which includes a whole bunch of information which I don't share in the public version, but they start from a common base. I think this means we will always need an ID and it will probably always need to be part of the application layer API - unless we can find a way to subsume all privacy and visibility APIs into a layer which is independent of the application, which seems like a project that might take some time. |
Happy to help provide references: For the drag and drag, one-binary-thing I was mentioning, the bits that Apple started pushing is called "Transferrable". I wish there was more to read on it, but there's a WWDC video that provides an overview (14min long), as well as some minimal API docs and then API that enables system-provided capabilities to "share" content (https://developer.apple.com/documentation/SwiftUI/ShareLink). I don't have a solid idea of HOW to integrate Automerge with that, but was leaning torwards the idea of supporting it by sharing an Automerge doc as the type that gets serialized. Is there a clearly defined detail about parsing out the Automerge URL format? You mentioned even potentially linking to content (inside a schema within a Doc?) using it, and I don't think I've come across anything yet other than a loose I think I have a slightly negative reaction to using them because they can be so tricky to parse correctly, and what I think of as the misalignment with the platform usage, but you're absolutely right that they're structured formats that incorporate the idea of a location (id) and can be built on from there. |
I'm not sure that I understand the problem caused by a common bundle format here. I expect the bundle format could have an API that allows you to read both rootId and all included Ids. Just not write. So you could still test in code if it makes sense to merge. In my understanding the purpose of the bundle format is just:
Another nice feature of the bundle format is CLI case. Currently it would be impossible to write a generic CLI tool that could inspect macOS automerge backed documents, since the ID must be stored in the format, and each app will decide how to do that differently. With a standardized bundle format then that document/id association will be standardized in the file format. So a CLI could interact with any automerge backed app that externalizes using the package format. All down the road, but I think useful. |
No qualms with the bundle format, I like the idea and where that's going. Was thinking some on what a more useful API would be that leveraged that - like a ".load()" and some light thinking through of where the relevant base fileURL is configured and how to work with it. My point was more that I saw value in allowing create() to specify an ID, so that you could load documents that had a serialized ID within them, even if that was a potential footgun for duplicate ID scenarios. Adding an API to do that more safely with a bundle format makes a ton of sense. |
I think the problem with create taking an ID is that it's the wrong verb. If you already have an ID then the document has already been created somewhere else. You can't be creating it a second time. Instead you musts be wanting to load, which has different expectations:
If for some reason you really did want to associate arbitrary document with an ID then the bundle runtime object could include an escape hatch constructor that allowed you to create that association. I'm still not sure when you would need that, but it would be safer there than in the main repo API.
I don't understand how this would be different from encoded bundle. In simple case an encoded bundle would contain the exact same info, id:document. Edit And now that I continue to think about this I do see a place where I would need to escape hatch myself. In the NSDocument case, when you create a new document you will create the DocumentId, but you may not add to repo right away. So escape hatch is needed to associate your DocumentId and Document in repo. Still I think it would be cleaner to do that by constructing a bundle and loading it, rather then passing id into create. |
Aside from the bundle concept (which I think is a good idea, just larger than I want to try in a singular, small update), I think the changes we're leaning towards: Keeping these:
And referencing the question in #95, if a handle is in the deleted or unavailable state when loading, add in the document and convert it's state to This jive with your thinking (@jessegrosjean & @alexjg ?) bikesheddingI was half-thinking of reserving the verb I'm thinking |
Yeah, to me that all makes sense, and would be useful right away for my demo app.
Of those I think I like import best with load a close second. Using |
(there is a storage provider setup, and a test one to work out the interactions, but nothing yet that hits a filesystem on iOS/macOS - added an issue to make one: #97 |
resolved with implementation of |
For example this seems to work without error:
I think without error it means that doc1 is lost from the repo and there's not really any way for original caller of repo create to detect that document instance has been replaced.
The text was updated successfully, but these errors were encountered: