-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability to remove archive documents from ElasticSearch using original URL #364
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly LGTM 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be safer to have the script prepare files with lists of WARC files (to be used with @file
on the command line); rather than running the delete: that way the lists can be prepared and inspected before the deletion is done!
751e7da
to
012f837
Compare
I have run the script to get the WARC files from B2 as follows:
Attached is the list of files we expect to process and delete from Elasticsearch: |
@m453h the list should include more files (ones with just mc- prefix and ones with mcrss- prefix), but I think that discussion can take place in issue #353 I think we should have many eyes examine the generated lists before doing the erasure: @pgulley should be add some assignees to issue 353 (above) as we continue the process? |
I've put notes on how to find the WARC files that need to be "erased" at #353 (comment) I think all discussion of the "erase lists" belongs in issue #353 rather than here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀 LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some suggestions on simplifications
As an old Unix programmer I think programs should do one thing, and as
an old Python programmer, I think there should be one way to do things.
My concerns are:
I'd like to be able to review what is going to be removed: to see the
files that contain the names of the files that will be processed, not
the list of files that will be processed if almost exactly the same
command is later entered.
It's incredibly easy to mistype a command. I recently accidentally
shut down a server on a friday night by typing the wrong thing in the
wrong window.
Having there be two ways to do it (using the reviewed file, or
re-entering the command) reduces my confidence that the reviewed lists
are going to be the ones actually used.
I'll leave the final call to @pgulley as the client.
To summarize: When serious/important/delicate operations are done,
I think it's best to have them done in repeatable, self documenting
ways.
For example, we're about to set up eight servers. I think it's
preferable to automate the process (even if it takes longer) so that
it's done consistently, so if (or when) it needs to be redone, the
same result is reached.
|
Hey @philbudne I agree with the concerns raised and believe its important to improve on the current approach. Before moving forward with these improvements, I want to ensure I’m not overlooking anything. I’m proposing the following changes to address the concerns:
Please let me know if this approach makes sense and effectively addresses the key concerns with the current implementation. |
@m453h Thanks for laying out that approach- I think using sqlite in the middle sounds fine, and having a singular spot in between the two processes helps make sure that things are done in order. @philbudne , does this address your concerns? I'd love to move on this task asap so we can clear our plates for the new servers. |
I had mentioned querying SQLite3 based Tracker files (used in the
Queuer framework to allow multiple active processes to atomically
claim ownership of one of many input files (WARC, RSS, CSV that might
be local or on S3 or B2) as making added code to write an output file
containing a list of files that had already been processsed.
I wasn't suggesting using SQLite to track the individual URLs to be
removed. SQLite doesn't do row locking, only full table locking, so
I'm concerned that it's most suitable for tracking files that contain
batches of work, not individual stories...
I think we're now talking about a five step process:
Step 1 can be Michael's script, preferable outputting a file
for the three ways we filled in the "dip":
a. From CSVs generated by reading RSS files from S3
(WARC files starting with mccsv-)
b. From RSS files, generating WARCs starting with mc-
(these are the ones that need the closest examination!)
c. From RSS files, generating WARCs starting with mcrss-
Step 2 "archive-eraser" renamed archive-lister(?); As many copies as
desired can be run, either with @file-a @file-b @file-c (and they
will compete for input files using the SQLite3 based tracker) or by
running three groups of one or more with each of the input files.
The "process_story" method would write the URL to a different file
specified with -o for each invocation. Having one output file
for each WARC would give additional information, and help split
up the work for the next step.
Step 3 the eraser: Could be based on the csv-queuer where the
"process_story" method does the erasing, and using the Tracker
mechanism to compete over "url list" files) so that the number
of workers at any time can be varied.
Step 4: Getting a list of "mchist2022-" warc generated in December into a file
or files using Michael's script.
Step 5: Copying the files from 4 into the a subdirectory in the
arch-indexer stack "/app/data" (ie named mchist2022) and starting
a stack with '-T archive' and '-I data/mchist2022'
(fuzzy on this step, I haven't run an archive stack in
almost EXACTLY a year!!)
|
Thanks @philbudne for the description of the steps, I have made the changes to the implementation we now have 3 separate scripts (I did rename
I hope this significantly improves the deletion process, I have also added some few safety features like retries when deletion from Elasticsearch fails. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏽
Since we've made some major changes, can you now include how the various outputs look like (may be by updating the PR description)
indexer/scripts/arch-eraser.py
Outdated
assert self.es_client | ||
success, _ = bulk( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The structure of these deletes is very similar:
- Can't we use the
retry_on_status
,max_retries
,max_backoff
, etc. to trigger retry automatically rather than having to create our own retry logic? - Is there any impact if we were to use
bulk
to delete a single document? Since this code isn't something we'll be running a lot, I'd like it to be a simple as possible so it's easier to come back to a year from now and still easily understand all code paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
This could possibly work and we could completely rely on the retry behavior that we get from the helper function.
However, one issue I encountered was that when a request times out, the client doesn’t retry the request and instead throws an exception. I did a bit more reading on this and one way would be to setretry_on_timeout=True
by overriding the method we are using to create an instance of the Elasticsearch client. I'll make some tests to be sure we'll get consistent behaviour. -
I'm not exactly sure if deleting one document using the bulk helper would have the same impact as deleting one document using the
delete
method, I had kept it as just an alternative if it happens that the bulk operations would be failing more often when we do the actual deletion (one idea we had initially discussed was to first try to do a bulk delete operation and if it fails to maybe try to try calling a delete operation per each individual document in our delete batch), but we could completely remove this path for the sake of simplicity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... and to be 100% clear, I'm not saying the bulk update is the better option (other than it appears to have retry logic built-in); I'm saying it's best we evaluate and pick one method that will work well for our case(s) and optimise for that rather than having 2 methods that may behave completely different at runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, from the initial tests I did locally, I would advise we go for the bulk delete option (It significantly reduces the time required to delete documents)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kilemensi one limitation I have faced while using the combination of retry_on_status
, max_retries
, max_backoff
is that the deletion process will halt with an exception when there is complete network failure and Elasticsearch fails to return a response. I believe we still need to retry in this scenario after waiting for a few seconds before exiting.
Co-authored-by: Clemence Kyara <[email protected]>
indexer/scripts/arch-eraser.py
Outdated
def _fetch_documents_to_delete(self, urls: List[str]): | ||
try: | ||
query = { | ||
"size": self.fetch_batch_size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised to see a query!
If it's just to find the ID of the document to delete, then you can use mcmetadata.urls.unique_url_hash()
on the URL.
Any reason to use the original_url
field (below)?
It ended up in ES because it's one of the outputs of mcmetadata, and in normal operation it should be identical to url
, and I don't think it's actually used by anything (if it was possible to remove fields in the move to the new cluster, it would be one of two I'd suggest tossing, the other is full_language)!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, for the added context ! @philbudne I have made the change to use the ID
constructed from mcmetadata.urls.unique_url_hash()
Regarding why we have a query, it is to retrieve both the index
and ID
of the document, as Elasticsearch requires this information for deletion. Since the stories could be in either mc_search-00002
or mc_search-00003
, I thought retrieving the story by original_url ID
to identify its index before constructing the list of stories for bulk_delete would be the most appropriate approach.
An alternative to prefetching the stories to determine their index would be to use the delete_by_query API. However, this can potentially use the scroll API when handling large datasets, adding more load to Elasticsearch. That's why I opted for a query using search_after and PIT.
I’m definitely open to exploring any other approach you think could improve this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@m453h haven't studied delete_by_query, but only one document should EVER match the "id" within a single index, and we try not to index a URL more than once by doing an id query against all indices before trying to import a story, so if the delete_by_query is limited to index ...02 and ...03 the worst case is two documents.
My concern is that the separate lookup and delete will have longer latency (two round trips thru the API) and more impact on the ES servers (locating the document twice). On the other hand I could be worrying about nothing!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@philbudne I’ve updated the implementation to use delete_by_query. Could you take a look and let me know if this approach looks better? 🤞
Also, I noticed from the docs that we can throttle requests which might help prevent the deletion process from overwhelming the cluster. I was thinking that this could also give us a more accurate way to estimate the total deletion time. I've currently not set any throttling, perhaps determining a realistic rate for document removal and setting it as the default value could be useful
…nto arch-eraser-sketch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, looks great to me!
This PR modifies the Arch Eraser sektch to add the ability to delete documents from ElasticSearch.
The deletion is a 3 step process which is achieved by using 3 different scripts:
1. Creating a list of WARC files to be processed:
This is done by using the
run-arch-warc-lister.sh
. The script takes a start date, end date, search pattern, and output file path as input and generates a file listing the WARC files that will be processed in the next step.The script can be invoked as follows:
The script arguments are described below:
<start_date>
<end_date>
<pattern>
<output>
Example:
This will output the list of WARC files in the path:
2. Creating a list of URLs to be deleted:
This is done by using the
run-arch-url-lister.sh
. The script receives an indirect file which is a file that contains a list of WARC files to process (generated in step 1) and outputs a list of URLs to be deleted (it produces one txt file per one WARC file it has processed by default).The script can be invoked as follows:
The script arguments are described below:
<input_file_path>
[output_file_path]
Example:
This will output a txt file per each WARC file listed in file-1.txt under the path:
3. Deletion of documents from Elasticsearch
This is done by using
run-arch-eraser.sh
. The script takes a path containing files with the list of URLs to delete (generated in step 2), along with other Elasticsearch-specific arguments, to carry out the deletion process.The script can be invoked as follows:
<path_to_url_list_file>
--elastic-search-hosts
--indices
--min-delay
--max-delay
--batch
Example:
Addresses #353