Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability to remove archive documents from ElasticSearch using original URL #364
Add ability to remove archive documents from ElasticSearch using original URL #364
Changes from 8 commits
48e9976
6775aec
cc525ca
50cadb2
35916e8
f01145d
3b4c738
c1dcbab
b34d911
da7080b
012f837
1fe69e8
08a65e2
3dfea92
a4288ce
341f469
16e3e77
bc74f94
4d9b69f
35c9680
00cfc17
e35bd5b
1192ec7
896d0c4
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised to see a query!
If it's just to find the ID of the document to delete, then you can use
mcmetadata.urls.unique_url_hash()
on the URL.Any reason to use the
original_url
field (below)?It ended up in ES because it's one of the outputs of mcmetadata, and in normal operation it should be identical to
url
, and I don't think it's actually used by anything (if it was possible to remove fields in the move to the new cluster, it would be one of two I'd suggest tossing, the other is full_language)!There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, for the added context ! @philbudne I have made the change to use the
ID
constructed frommcmetadata.urls.unique_url_hash()
Regarding why we have a query, it is to retrieve both the
index
andID
of the document, as Elasticsearch requires this information for deletion. Since the stories could be in eithermc_search-00002
ormc_search-00003
, I thought retrieving the story byoriginal_urlID
to identify its index before constructing the list of stories for bulk_delete would be the most appropriate approach.An alternative to prefetching the stories to determine their index would be to use the delete_by_query API. However, this can potentially use the scroll API when handling large datasets, adding more load to Elasticsearch. That's why I opted for a query using search_after and PIT.
I’m definitely open to exploring any other approach you think could improve this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@m453h haven't studied delete_by_query, but only one document should EVER match the "id" within a single index, and we try not to index a URL more than once by doing an id query against all indices before trying to import a story, so if the delete_by_query is limited to index ...02 and ...03 the worst case is two documents.
My concern is that the separate lookup and delete will have longer latency (two round trips thru the API) and more impact on the ES servers (locating the document twice). On the other hand I could be worrying about nothing!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@philbudne I’ve updated the implementation to use delete_by_query. Could you take a look and let me know if this approach looks better? 🤞
Also, I noticed from the docs that we can throttle requests which might help prevent the deletion process from overwhelming the cluster. I was thinking that this could also give us a more accurate way to estimate the total deletion time. I've currently not set any throttling, perhaps determining a realistic rate for document removal and setting it as the default value could be useful