Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to remove archive documents from ElasticSearch using original URL #364

Merged
merged 24 commits into from
Feb 26, 2025

Conversation

m453h
Copy link
Contributor

@m453h m453h commented Jan 27, 2025

This PR modifies the Arch Eraser sektch to add the ability to delete documents from ElasticSearch.

The deletion is a 3 step process which is achieved by using 3 different scripts:

1. Creating a list of WARC files to be processed:

This is done by using the run-arch-warc-lister.sh. The script takes a start date, end date, search pattern, and output file path as input and generates a file listing the WARC files that will be processed in the next step.

The script can be invoked as follows:

./bin/run-arch-warc-lister.sh <start_date> <end_date> <pattern> <output>

The script arguments are described below:

Argument Description
<start_date> Start date for filtering files (format: YYYY/MM/DD).
<end_date> End date for filtering files (format: YYYY/MM/DD).
<pattern> String pattern used to construct file paths (e.g., 'b2://archives/{pattern}/mchist2022').
<output> The path to the output file where the archive list will be written

Example:

./bin/run-arch-warc-lister.sh 2024/05/27 2024/05/27 "s3://mediacloud-indexer-archive/{pattern}/mc" file-1.txt 

This will output the list of WARC files in the path:

<PROJECT_ROOT_DIR>/data/arch-lister/warc_list/file-1.txt

2. Creating a list of URLs to be deleted:

This is done by using the run-arch-url-lister.sh. The script receives an indirect file which is a file that contains a list of WARC files to process (generated in step 1) and outputs a list of URLs to be deleted (it produces one txt file per one WARC file it has processed by default).

The script can be invoked as follows:

./bin/run-arch-url-lister.sh <input_file_path> [output_file_path]

The script arguments are described below:

Argument / Option Description
<input_file_path> Path to an indirect file (a file that contains a list of WARC files to process).
[output_file_path] Optional. Path to output file for the URL list. Default: WARC_FILE_NAME.txt

Example:

./bin/run-arch-url-lister.sh data/arch-lister/warc_list/file-1.txt

This will output a txt file per each WARC file listed in file-1.txt under the path:

<PROJECT_ROOT_DIR>/data/arch-lister/url_list/

3. Deletion of documents from Elasticsearch

This is done by using run-arch-eraser.sh. The script takes a path containing files with the list of URLs to delete (generated in step 2), along with other Elasticsearch-specific arguments, to carry out the deletion process.

The script can be invoked as follows:

 ./bin/run-arch-eraser.sh <path_to_url_list_file> [OPTIONS]
Argument / Option Description
<path_to_url_list_file> Path to folder that contains files with the list of URLs to delete
--elastic-search-hosts Elasticsearch host URL(s)
--indices The name of the Elasticsearch indices to delete documents from
--min-delay The minimum time to wait between delete operations (default: 0.5s)
--max-delay The maximum time to wait between delete operations (default: 3.0 seconds)
--batch The number of documents to send in a delete request to Elasticsearch. (default: 1000)

Example:

  ./bin/run-arch-eraser.sh  data/arch-lister/url_list --elasticsearch-hosts=http://localhost:9200 --indices=index1,index2 --min-delay=1 --max-delay=3 --batch=1000

Addresses #353

@m453h m453h marked this pull request as ready for review January 27, 2025 07:08
@m453h m453h requested a review from thepsalmist January 27, 2025 07:35
Copy link
Contributor

@thepsalmist thepsalmist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM 👍

Copy link
Contributor

@philbudne philbudne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be safer to have the script prepare files with lists of WARC files (to be used with @file on the command line); rather than running the delete: that way the lists can be prepared and inspected before the deletion is done!

@m453h m453h force-pushed the arch-eraser-sketch branch from 751e7da to 012f837 Compare January 29, 2025 17:53
@m453h
Copy link
Contributor Author

m453h commented Jan 29, 2025

I have run the script to get the WARC files from B2 as follows:

./bin/run-arch-eraser.sh --generate-erase-list 2024/12/14 2024/12/31 "b2://mediacloud-indexer-archive/{pattern}/mchist2022" --output-file=arch-eraser/erase-list.txt 

Attached is the list of files we expect to process and delete from Elasticsearch:

erase-list.txt

@philbudne
Copy link
Contributor

@m453h the list should include more files (ones with just mc- prefix and ones with mcrss- prefix), but I think that discussion can take place in issue #353

I think we should have many eyes examine the generated lists before doing the erasure: @pgulley should be add some assignees to issue 353 (above) as we continue the process?

@m453h m453h requested a review from thepsalmist January 30, 2025 12:50
@philbudne
Copy link
Contributor

I've put notes on how to find the WARC files that need to be "erased" at #353 (comment)

I think all discussion of the "erase lists" belongs in issue #353 rather than here

Copy link
Contributor

@kilemensi kilemensi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

Copy link
Contributor

@thepsalmist thepsalmist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 LGTM

@m453h m453h requested a review from pgulley February 3, 2025 13:35
@philbudne philbudne self-requested a review February 3, 2025 16:18
Copy link
Contributor

@philbudne philbudne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestions on simplifications

@philbudne
Copy link
Contributor

philbudne commented Feb 4, 2025 via email

@m453h
Copy link
Contributor Author

m453h commented Feb 4, 2025

As an old Unix programmer I think programs should do one thing, and as an old Python programmer, I think there should be one way to do things. My concerns are: I'd like to be able to review what is going to be removed: to see the files that contain the names of the files that will be processed, not the list of files that will be processed if almost exactly the same command is later entered. It's incredibly easy to mistype a command. I recently accidentally shut down a server on a friday night by typing the wrong thing in the wrong window. Having there be two ways to do it (using the reviewed file, or re-entering the command) reduces my confidence that the reviewed lists are going to be the ones actually used. I'll leave the final call to @pgulley as the client. To summarize: When serious/important/delicate operations are done, I think it's best to have them done in repeatable, self documenting ways. For example, we're about to set up eight servers. I think it's preferable to automate the process (even if it takes longer) so that it's done consistently, so if (or when) it needs to be redone, the same result is reached.

Hey @philbudne I agree with the concerns raised and believe its important to improve on the current approach. Before moving forward with these improvements, I want to ensure I’m not overlooking anything. I’m proposing the following changes to address the concerns:

  1. Use the SQLite3FileTracker to store the names of the files to be deleted this would mean I need to either modify the SQLite schema to also store the full path of the file to be deleted or store the full path instead of just the name. Any audit of the records that will be deleted should be done on the SQLite database instead of using an external txt files.

  2. Implement separate processes that will be invoked by two different bash script.

    • The first process will populate the database with the list of the files to be deleted (Files will have a NOT_STARTED status)
    • The second process will be responsible for the actual deletion, it will read from the database and call the process_file method with the delete option. This will ensure that only WARC files in the list of files to be deleted stored in SQLite will be used for deletion from ElasticSearch. Additionally, for extra safety I could perhaps include another status that would imply that we have verified the files in SQLite, ensuring that the second process only starts after confirmation.

Please let me know if this approach makes sense and effectively addresses the key concerns with the current implementation.

@pgulley
Copy link
Member

pgulley commented Feb 4, 2025

@m453h Thanks for laying out that approach- I think using sqlite in the middle sounds fine, and having a singular spot in between the two processes helps make sure that things are done in order. @philbudne , does this address your concerns? I'd love to move on this task asap so we can clear our plates for the new servers.

@philbudne
Copy link
Contributor

philbudne commented Feb 5, 2025 via email

@m453h
Copy link
Contributor Author

m453h commented Feb 9, 2025

Thanks @philbudne for the description of the steps, I have made the changes to the implementation we now have 3 separate scripts (I did rename arch-eraser.py to arch-lister.py which makes more sense since the class only does listing of files):
Here is a brief description of the scripts:

  • run-arch-warc-lister.sh: this receives start date, end date and search pattern as input and outputs a file that contains a list of WARC files to be processed, it will be used for Step 1 and Step 4
  • run-arch-url-lister.sh: this receives an indirect file that contains a list of WARC files to process (generated by run-arch-warc-lister.sh) and outputs a list of URLs to be deleted (it produces one txt file per one WARC file it has processed by default)
  • run-arch-eraser: this receives a path containing files with list of URLs to delete (generated by run-arch-url-lister.sh) from elasticsearch and does the actual deletion.

I hope this significantly improves the deletion process, I have also added some few safety features like retries when deletion from Elasticsearch fails.

@m453h m453h requested a review from philbudne February 14, 2025 11:32
Copy link
Contributor

@kilemensi kilemensi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏽


Since we've made some major changes, can you now include how the various outputs look like (may be by updating the PR description)

Comment on lines 199 to 200
assert self.es_client
success, _ = bulk(
Copy link
Contributor

@kilemensi kilemensi Feb 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The structure of these deletes is very similar:

  1. Can't we use the retry_on_status, max_retries, max_backoff, etc. to trigger retry automatically rather than having to create our own retry logic?
  2. Is there any impact if we were to use bulk to delete a single document? Since this code isn't something we'll be running a lot, I'd like it to be a simple as possible so it's easier to come back to a year from now and still easily understand all code paths.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. This could possibly work and we could completely rely on the retry behavior that we get from the helper function.
    However, one issue I encountered was that when a request times out, the client doesn’t retry the request and instead throws an exception. I did a bit more reading on this and one way would be to set retry_on_timeout=True by overriding the method we are using to create an instance of the Elasticsearch client. I'll make some tests to be sure we'll get consistent behaviour.

  2. I'm not exactly sure if deleting one document using the bulk helper would have the same impact as deleting one document using the delete method, I had kept it as just an alternative if it happens that the bulk operations would be failing more often when we do the actual deletion (one idea we had initially discussed was to first try to do a bulk delete operation and if it fails to maybe try to try calling a delete operation per each individual document in our delete batch), but we could completely remove this path for the sake of simplicity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and to be 100% clear, I'm not saying the bulk update is the better option (other than it appears to have retry logic built-in); I'm saying it's best we evaluate and pick one method that will work well for our case(s) and optimise for that rather than having 2 methods that may behave completely different at runtime.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, from the initial tests I did locally, I would advise we go for the bulk delete option (It significantly reduces the time required to delete documents)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kilemensi one limitation I have faced while using the combination of retry_on_status, max_retries, max_backoff is that the deletion process will halt with an exception when there is complete network failure and Elasticsearch fails to return a response. I believe we still need to retry in this scenario after waiting for a few seconds before exiting.

def _fetch_documents_to_delete(self, urls: List[str]):
try:
query = {
"size": self.fetch_batch_size,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised to see a query!

If it's just to find the ID of the document to delete, then you can use mcmetadata.urls.unique_url_hash() on the URL.

Any reason to use the original_url field (below)?

It ended up in ES because it's one of the outputs of mcmetadata, and in normal operation it should be identical to url, and I don't think it's actually used by anything (if it was possible to remove fields in the move to the new cluster, it would be one of two I'd suggest tossing, the other is full_language)!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, for the added context ! @philbudne I have made the change to use the ID constructed from mcmetadata.urls.unique_url_hash()

Regarding why we have a query, it is to retrieve both the index and ID of the document, as Elasticsearch requires this information for deletion. Since the stories could be in either mc_search-00002 or mc_search-00003, I thought retrieving the story by original_url ID to identify its index before constructing the list of stories for bulk_delete would be the most appropriate approach.

An alternative to prefetching the stories to determine their index would be to use the delete_by_query API. However, this can potentially use the scroll API when handling large datasets, adding more load to Elasticsearch. That's why I opted for a query using search_after and PIT.

I’m definitely open to exploring any other approach you think could improve this!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@m453h haven't studied delete_by_query, but only one document should EVER match the "id" within a single index, and we try not to index a URL more than once by doing an id query against all indices before trying to import a story, so if the delete_by_query is limited to index ...02 and ...03 the worst case is two documents.

My concern is that the separate lookup and delete will have longer latency (two round trips thru the API) and more impact on the ES servers (locating the document twice). On the other hand I could be worrying about nothing!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@philbudne I’ve updated the implementation to use delete_by_query. Could you take a look and let me know if this approach looks better? 🤞

Also, I noticed from the docs that we can throttle requests which might help prevent the deletion process from overwhelming the cluster. I was thinking that this could also give us a more accurate way to estimate the total deletion time. I've currently not set any throttling, perhaps determining a realistic rate for document removal and setting it as the default value could be useful

@m453h m453h requested a review from philbudne February 18, 2025 14:28
Copy link
Contributor

@philbudne philbudne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@m453h
Copy link
Contributor Author

m453h commented Feb 24, 2025

Hey @pgulley could you have a final look to confirm if this is ready for merging ? I believe after merging the next steps would be to confirm if the list of WARC files I generated on #353 is correct and decide when it would be appropriate to run the script for the actual deletion

Copy link
Member

@pgulley pgulley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, looks great to me!

@m453h m453h merged commit caf6306 into mediacloud:main Feb 26, 2025
@m453h m453h deleted the arch-eraser-sketch branch February 26, 2025 15:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants