Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend CLI document get to fetch multiple documents. #33071

Merged
merged 2 commits into from
Jan 6, 2025

Conversation

wix-mikej
Copy link
Contributor

I confirm that this contribution is made under the terms of the license found in the root directory of this repository's source tree and that I have the authority necessary to make this contribution on behalf of its copyright owner.

We had a use-case where we needed to pull many individual documents from a Vespa instance for relay to another. The existing document get mechanism incurred a lot of process overhead to retrieve multiple documents, so I adapted the client to allow multiple id's to be passed and processed serially. It will loop until all specified document retrievals are completed, it does not halt early for server errors or missing documents as the original implementation did not.

@kkraune kkraune requested a review from mpolden January 3, 2025 08:06

for _, docId := range parsedIds {
result := client.Get(docId, fieldSet)
printResult(cli, operationResult(true, document.Document{Id: docId}, service, result), true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

printResult can return an error so this should continue returning it to the caller.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took the liberty of fixing this myself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mpolden Because (in our use case) we wanted it to process all documents, even if some fail (due to not existing), I was intentionally not exiting early. We would let printResult log it and continue. There is an argument to be made that the document get was successful in this case, there was just no data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, how about this then? We print all documents but return an error if at least one read failed so that the command exits with non-zero.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the client returns a non-zero status, that would indicate to our runner process that it failed (and it would be retried).

I don't know if there are other non-critical errors that could get bubbled up, but perhaps it only makes sense to continue on 404's? In that case, what about making it opt-in with a flag; something like --continue-on-missing or similar?

Copy link
Member

@mpolden mpolden Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like vespa visit is a better fit for your use case. It will also be faster. For example:

$ vespa visit --selection 'id = "id:mynamespace:music::a-head-full-of-dreams" OR id = "id:mynamespace:music::hardwired-to-self-destruct" OR id = "id:mynamespace:music::foobar"'
{"id":"id:mynamespace:music::a-head-full-of-dreams","fields":{"artist":"Coldplay","year":2015,"category_scores":{"type":"tensor<float>(cat{})","cells":{"pop":1.0,"rock":0.20000000298023224,"jazz":0.0}},"album":"A Head Full of Dreams"}}
{"id":"id:mynamespace:music::hardwired-to-self-destruct","fields":{"artist":"Metallica","year":2016,"category_scores":{"type":"tensor<float>(cat{})","cells":{"pop":0.0,"rock":1.0,"jazz":0.0}},"album":"Hardwired...To Self-Destruct"}}

You can even pipe this to directly to vespa feed to relay the documents to another instance: vespa visit -a tenant.app.instance1 ... | vespa feed -a tenant.app.instance2 -.

See https://docs.vespa.ai/en/vespa-cli.html#cheat-sheet and https://docs.vespa.ai/en/reference/document-select-language.html.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using visit definitely would have made the job easier, had it worked in our use case. I don't know/understand all the details, but you can chat with @kkraune if you want to get the full rundown. I will take a stab at a PR to allow what I was proposing, and you can determine if it's worth merging at that point.

@mpolden mpolden merged commit f876534 into vespa-engine:master Jan 6, 2025
1 check passed
@wix-mikej wix-mikej deleted the multi-get branch January 8, 2025 19:05
@kkraune
Copy link
Member

kkraune commented Jan 9, 2025

there are good cases for a multi-get, so I think it makes sense to add this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants