Add ability to remove archive documents from ElasticSearch using original URL #364

m453h · 2025-01-27T07:07:48Z

This PR modifies the Arch Eraser sektch to add the ability to delete documents from ElasticSearch.

The deletion is a 3 step process which is achieved by using 3 different scripts:

1. Creating a list of WARC files to be processed:

This is done by using the run-arch-warc-lister.sh. The script takes a start date, end date, search pattern, and output file path as input and generates a file listing the WARC files that will be processed in the next step.

The script can be invoked as follows:

./bin/run-arch-warc-lister.sh <start_date> <end_date> <pattern> <output>

The script arguments are described below:

Argument	Description
`<start_date>`	Start date for filtering files (format: YYYY/MM/DD).
`<end_date>`	End date for filtering files (format: YYYY/MM/DD).
`<pattern>`	String pattern used to construct file paths (e.g., 'b2://archives/{pattern}/mchist2022').
`<output>`	The path to the output file where the archive list will be written

Example:

./bin/run-arch-warc-lister.sh 2024/05/27 2024/05/27 "s3://mediacloud-indexer-archive/{pattern}/mc" file-1.txt

This will output the list of WARC files in the path:

<PROJECT_ROOT_DIR>/data/arch-lister/warc_list/file-1.txt

2. Creating a list of URLs to be deleted:

This is done by using the run-arch-url-lister.sh. The script receives an indirect file which is a file that contains a list of WARC files to process (generated in step 1) and outputs a list of URLs to be deleted (it produces one txt file per one WARC file it has processed by default).

The script can be invoked as follows:

./bin/run-arch-url-lister.sh <input_file_path> [output_file_path]

The script arguments are described below:

Argument / Option	Description
`<input_file_path>`	Path to an indirect file (a file that contains a list of WARC files to process).
`[output_file_path]`	Optional. Path to output file for the URL list. Default: WARC_FILE_NAME.txt

Example:

./bin/run-arch-url-lister.sh data/arch-lister/warc_list/file-1.txt

This will output a txt file per each WARC file listed in file-1.txt under the path:

<PROJECT_ROOT_DIR>/data/arch-lister/url_list/

3. Deletion of documents from Elasticsearch

This is done by using run-arch-eraser.sh. The script takes a path containing files with the list of URLs to delete (generated in step 2), along with other Elasticsearch-specific arguments, to carry out the deletion process.

The script can be invoked as follows:

 ./bin/run-arch-eraser.sh <path_to_url_list_file> [OPTIONS]

Argument / Option	Description
`<path_to_url_list_file>`	Path to folder that contains files with the list of URLs to delete
`--elastic-search-hosts`	Elasticsearch host URL(s)
`--indices`	The name of the Elasticsearch indices to delete documents from
`--min-delay`	The minimum time to wait between delete operations (default: 0.5s)
`--max-delay`	The maximum time to wait between delete operations (default: 3.0 seconds)
`--batch`	The number of documents to send in a delete request to Elasticsearch. (default: 1000)

Example:

  ./bin/run-arch-eraser.sh  data/arch-lister/url_list --elasticsearch-hosts=http://localhost:9200 --indices=index1,index2 --min-delay=1 --max-delay=3 --batch=1000

Addresses #353

thepsalmist

Mostly LGTM 👍

indexer/scripts/arch-eraser.py

philbudne

I think it would be safer to have the script prepare files with lists of WARC files (to be used with @file on the command line); rather than running the delete: that way the lists can be prepared and inspected before the deletion is done!

m453h · 2025-01-29T18:00:15Z

I have run the script to get the WARC files from B2 as follows:

./bin/run-arch-eraser.sh --generate-erase-list 2024/12/14 2024/12/31 "b2://mediacloud-indexer-archive/{pattern}/mchist2022" --output-file=arch-eraser/erase-list.txt

Attached is the list of files we expect to process and delete from Elasticsearch:

erase-list.txt

philbudne · 2025-01-29T18:22:16Z

@m453h the list should include more files (ones with just mc- prefix and ones with mcrss- prefix), but I think that discussion can take place in issue #353

I think we should have many eyes examine the generated lists before doing the erasure: @pgulley should be add some assignees to issue 353 (above) as we continue the process?

philbudne · 2025-01-30T15:53:42Z

I've put notes on how to find the WARC files that need to be "erased" at #353 (comment)

I think all discussion of the "erase lists" belongs in issue #353 rather than here

kilemensi

🚀

indexer/scripts/arch-eraser.py

indexer/queuer.py

bin/run-arch-eraser.sh

thepsalmist

🚀 LGTM

philbudne

Some suggestions on simplifications

bin/run-arch-eraser.sh

.gitignore

indexer/queuer.py

…content

philbudne · 2025-02-04T06:54:27Z

As an old Unix programmer I think programs should do one thing, and as an old Python programmer, I think there should be one way to do things. My concerns are: I'd like to be able to review what is going to be removed: to see the files that contain the names of the files that will be processed, not the list of files that will be processed if almost exactly the same command is later entered. It's incredibly easy to mistype a command. I recently accidentally shut down a server on a friday night by typing the wrong thing in the wrong window. Having there be two ways to do it (using the reviewed file, or re-entering the command) reduces my confidence that the reviewed lists are going to be the ones actually used. I'll leave the final call to @pgulley as the client. To summarize: When serious/important/delicate operations are done, I think it's best to have them done in repeatable, self documenting ways. For example, we're about to set up eight servers. I think it's preferable to automate the process (even if it takes longer) so that it's done consistently, so if (or when) it needs to be redone, the same result is reached.

m453h · 2025-02-04T15:07:32Z

As an old Unix programmer I think programs should do one thing, and as an old Python programmer, I think there should be one way to do things. My concerns are: I'd like to be able to review what is going to be removed: to see the files that contain the names of the files that will be processed, not the list of files that will be processed if almost exactly the same command is later entered. It's incredibly easy to mistype a command. I recently accidentally shut down a server on a friday night by typing the wrong thing in the wrong window. Having there be two ways to do it (using the reviewed file, or re-entering the command) reduces my confidence that the reviewed lists are going to be the ones actually used. I'll leave the final call to @pgulley as the client. To summarize: When serious/important/delicate operations are done, I think it's best to have them done in repeatable, self documenting ways. For example, we're about to set up eight servers. I think it's preferable to automate the process (even if it takes longer) so that it's done consistently, so if (or when) it needs to be redone, the same result is reached.

Hey @philbudne I agree with the concerns raised and believe its important to improve on the current approach. Before moving forward with these improvements, I want to ensure I’m not overlooking anything. I’m proposing the following changes to address the concerns:

Use the SQLite3FileTracker to store the names of the files to be deleted this would mean I need to either modify the SQLite schema to also store the full path of the file to be deleted or store the full path instead of just the name. Any audit of the records that will be deleted should be done on the SQLite database instead of using an external txt files.
Implement separate processes that will be invoked by two different bash script.
- The first process will populate the database with the list of the files to be deleted (Files will have a NOT_STARTED status)
- The second process will be responsible for the actual deletion, it will read from the database and call the process_file method with the delete option. This will ensure that only WARC files in the list of files to be deleted stored in SQLite will be used for deletion from ElasticSearch. Additionally, for extra safety I could perhaps include another status that would imply that we have verified the files in SQLite, ensuring that the second process only starts after confirmation.

Please let me know if this approach makes sense and effectively addresses the key concerns with the current implementation.

pgulley · 2025-02-04T19:25:29Z

@m453h Thanks for laying out that approach- I think using sqlite in the middle sounds fine, and having a singular spot in between the two processes helps make sure that things are done in order. @philbudne , does this address your concerns? I'd love to move on this task asap so we can clear our plates for the new servers.

philbudne · 2025-02-05T23:19:54Z

I had mentioned querying SQLite3 based Tracker files (used in the Queuer framework to allow multiple active processes to atomically claim ownership of one of many input files (WARC, RSS, CSV that might be local or on S3 or B2) as making added code to write an output file containing a list of files that had already been processsed. I wasn't suggesting using SQLite to track the individual URLs to be removed. SQLite doesn't do row locking, only full table locking, so I'm concerned that it's most suitable for tracking files that contain batches of work, not individual stories... I think we're now talking about a five step process: Step 1 can be Michael's script, preferable outputting a file for the three ways we filled in the "dip": a. From CSVs generated by reading RSS files from S3 (WARC files starting with mccsv-) b. From RSS files, generating WARCs starting with mc- (these are the ones that need the closest examination!) c. From RSS files, generating WARCs starting with mcrss- Step 2 "archive-eraser" renamed archive-lister(?); As many copies as desired can be run, either with @file-a @file-b @file-c (and they will compete for input files using the SQLite3 based tracker) or by running three groups of one or more with each of the input files. The "process_story" method would write the URL to a different file specified with -o for each invocation. Having one output file for each WARC would give additional information, and help split up the work for the next step. Step 3 the eraser: Could be based on the csv-queuer where the "process_story" method does the erasing, and using the Tracker mechanism to compete over "url list" files) so that the number of workers at any time can be varied. Step 4: Getting a list of "mchist2022-" warc generated in December into a file or files using Michael's script. Step 5: Copying the files from 4 into the a subdirectory in the arch-indexer stack "/app/data" (ie named mchist2022) and starting a stack with '-T archive' and '-I data/mchist2022' (fuzzy on this step, I haven't run an archive stack in almost EXACTLY a year!!)

m453h · 2025-02-09T14:49:04Z

Thanks @philbudne for the description of the steps, I have made the changes to the implementation we now have 3 separate scripts (I did rename arch-eraser.py to arch-lister.py which makes more sense since the class only does listing of files):
Here is a brief description of the scripts:

run-arch-warc-lister.sh: this receives start date, end date and search pattern as input and outputs a file that contains a list of WARC files to be processed, it will be used for Step 1 and Step 4
run-arch-url-lister.sh: this receives an indirect file that contains a list of WARC files to process (generated by run-arch-warc-lister.sh) and outputs a list of URLs to be deleted (it produces one txt file per one WARC file it has processed by default)
run-arch-eraser: this receives a path containing files with list of URLs to delete (generated by run-arch-url-lister.sh) from elasticsearch and does the actual deletion.

I hope this significantly improves the deletion process, I have also added some few safety features like retries when deletion from Elasticsearch fails.

kilemensi

👍🏽

Since we've made some major changes, can you now include how the various outputs look like (may be by updating the PR description)

indexer/workers/fetcher/csv_queuer.py

kilemensi · 2025-02-14T11:12:46Z

indexer/scripts/arch-eraser.py

+                assert self.es_client
+                success, _ = bulk(


The structure of these deletes is very similar:

Can't we use the retry_on_status, max_retries, max_backoff, etc. to trigger retry automatically rather than having to create our own retry logic?

Is there any impact if we were to use bulk to delete a single document? Since this code isn't something we'll be running a lot, I'd like it to be a simple as possible so it's easier to come back to a year from now and still easily understand all code paths.

This could possibly work and we could completely rely on the retry behavior that we get from the helper function.
However, one issue I encountered was that when a request times out, the client doesn’t retry the request and instead throws an exception. I did a bit more reading on this and one way would be to set retry_on_timeout=True by overriding the method we are using to create an instance of the Elasticsearch client. I'll make some tests to be sure we'll get consistent behaviour.

I'm not exactly sure if deleting one document using the bulk helper would have the same impact as deleting one document using the delete method, I had kept it as just an alternative if it happens that the bulk operations would be failing more often when we do the actual deletion (one idea we had initially discussed was to first try to do a bulk delete operation and if it fails to maybe try to try calling a delete operation per each individual document in our delete batch), but we could completely remove this path for the sake of simplicity.

... and to be 100% clear, I'm not saying the bulk update is the better option (other than it appears to have retry logic built-in); I'm saying it's best we evaluate and pick one method that will work well for our case(s) and optimise for that rather than having 2 methods that may behave completely different at runtime.

Sure, from the initial tests I did locally, I would advise we go for the bulk delete option (It significantly reduces the time required to delete documents)

@kilemensi one limitation I have faced while using the combination of retry_on_status, max_retries, max_backoff is that the deletion process will halt with an exception when there is complete network failure and Elasticsearch fails to return a response. I believe we still need to retry in this scenario after waiting for a few seconds before exiting.

indexer/scripts/arch-lister.py

Co-authored-by: Clemence Kyara <[email protected]>

indexer/scripts/arch-eraser.py

philbudne · 2025-02-15T05:19:19Z

indexer/scripts/arch-eraser.py

+    def _fetch_documents_to_delete(self, urls: List[str]):
+        try:
+            query = {
+                "size": self.fetch_batch_size,


I'm surprised to see a query!

If it's just to find the ID of the document to delete, then you can use mcmetadata.urls.unique_url_hash() on the URL.

Any reason to use the original_url field (below)?

It ended up in ES because it's one of the outputs of mcmetadata, and in normal operation it should be identical to url, and I don't think it's actually used by anything (if it was possible to remove fields in the move to the new cluster, it would be one of two I'd suggest tossing, the other is full_language)!

Thanks, for the added context ! @philbudne I have made the change to use the ID constructed from mcmetadata.urls.unique_url_hash()

Regarding why we have a query, it is to retrieve both the index and ID of the document, as Elasticsearch requires this information for deletion. Since the stories could be in either mc_search-00002 or mc_search-00003, I thought retrieving the story by ~~original_url~~ ID to identify its index before constructing the list of stories for bulk_delete would be the most appropriate approach.

An alternative to prefetching the stories to determine their index would be to use the delete_by_query API. However, this can potentially use the scroll API when handling large datasets, adding more load to Elasticsearch. That's why I opted for a query using search_after and PIT.

I’m definitely open to exploring any other approach you think could improve this!

@m453h haven't studied delete_by_query, but only one document should EVER match the "id" within a single index, and we try not to index a URL more than once by doing an id query against all indices before trying to import a story, so if the delete_by_query is limited to index ...02 and ...03 the worst case is two documents.

My concern is that the separate lookup and delete will have longer latency (two round trips thru the API) and more impact on the ES servers (locating the document twice). On the other hand I could be worrying about nothing!

@philbudne I’ve updated the implementation to use delete_by_query. Could you take a look and let me know if this approach looks better? 🤞

Also, I noticed from the docs that we can throttle requests which might help prevent the deletion process from overwhelming the cluster. I was thinking that this could also give us a more accurate way to estimate the total deletion time. I've currently not set any throttling, perhaps determining a realistic rate for document removal and setting it as the default value could be useful

indexer/workers/fetcher/csv_queuer.py

…nto arch-eraser-sketch

philbudne

Looks good!

m453h · 2025-02-24T15:06:04Z

Hey @pgulley could you have a final look to confirm if this is ready for merging ? I believe after merging the next steps would be to confirm if the list of WARC files I generated on #353 is correct and decide when it would be appropriate to run the script for the actual deletion

pgulley

yes, looks great to me!

philbudne and others added 6 commits January 9, 2025 21:54

Add indexer/scripts/arch-eraser.py

48e9976

Add quick "indirect file" support to Queuer class

6775aec

indexer/scripts/arch-eraser.py: update top docstring

cc525ca

Add script to invoke arch-eraser

50cadb2

Allow passing of extra args

35916e8

Add ability to delete documents from ElasticSearch

f01145d

m453h marked this pull request as ready for review January 27, 2025 07:08

Clean up indexer/elastic.py

3b4c738

m453h requested a review from thepsalmist January 27, 2025 07:35

thepsalmist reviewed Jan 27, 2025

View reviewed changes

indexer/scripts/arch-eraser.py Outdated Show resolved Hide resolved

thepsalmist reviewed Jan 27, 2025

View reviewed changes

indexer/scripts/arch-eraser.py Outdated Show resolved Hide resolved

philbudne requested changes Jan 28, 2025

View reviewed changes

Add option to write output of WARC files to process to txt file

c1dcbab

philbudne approved these changes Jan 29, 2025

View reviewed changes

m453h added 3 commits January 29, 2025 19:38

Improve bash script to handle both erase and generate erase list options

b34d911

Replace f-strings with %-formatting in log messages

da7080b

Clean up run-arch-eraser script

012f837

m453h force-pushed the arch-eraser-sketch branch from 751e7da to 012f837 Compare January 29, 2025 17:53

m453h requested a review from thepsalmist January 30, 2025 12:50

kilemensi approved these changes Jan 31, 2025

View reviewed changes

indexer/scripts/arch-eraser.py Outdated Show resolved Hide resolved

indexer/scripts/arch-eraser.py Outdated Show resolved Hide resolved

indexer/queuer.py Outdated Show resolved Hide resolved

indexer/queuer.py Outdated Show resolved Hide resolved

bin/run-arch-eraser.sh Show resolved Hide resolved

kilemensi assigned m453h Jan 31, 2025

m453h added 2 commits February 3, 2025 11:47

Add option to set random delay(s) between elastic search deletions

1fe69e8

Improve type checking in queuer

08a65e2

thepsalmist approved these changes Feb 3, 2025

View reviewed changes

m453h requested a review from pgulley February 3, 2025 13:35

Update minimum and maximum delay time help message

3dfea92

philbudne self-requested a review February 3, 2025 16:18

philbudne requested changes Feb 3, 2025

View reviewed changes

bin/run-arch-eraser.sh Outdated Show resolved Hide resolved

.gitignore Outdated Show resolved Hide resolved

indexer/queuer.py Outdated Show resolved Hide resolved

Simplify arch-eraser bash script help and remove redudant .gitignore …

a4288ce

…content

m453h added 2 commits February 7, 2025 11:30

Rename ArchEraser to ArchLister and split WARC and URL listing scripts

341f469

Add arch-eraser script

16e3e77

m453h requested a review from philbudne February 14, 2025 11:32

kilemensi reviewed Feb 14, 2025

View reviewed changes

Update indexer/scripts/arch-lister.py

bc74f94

Co-authored-by: Clemence Kyara <[email protected]>

philbudne requested changes Feb 15, 2025

View reviewed changes

m453h added 4 commits February 18, 2025 11:33

Simplify arch-eraser by using only batch delete and use Ids to query

4d9b69f

Merge branch 'arch-eraser-sketch' of github.com:m453h/story-indexer i…

35c9680

…nto arch-eraser-sketch

Fix error in arch-lister due to identation

00cfc17

Improve arch-eraser and ensure arch-lister outputs to specified path

e35bd5b

m453h requested a review from philbudne February 18, 2025 14:28

m453h added 2 commits February 18, 2025 20:14

Update run-arch-eraser.sh and run-arch-warc-lister help messages

1192ec7

Modify arch-eraser to use delete_by_query API

896d0c4

philbudne approved these changes Feb 20, 2025

View reviewed changes

pgulley approved these changes Feb 24, 2025

View reviewed changes

m453h merged commit caf6306 into mediacloud:main Feb 26, 2025

m453h deleted the arch-eraser-sketch branch February 26, 2025 15:32

pgulley mentioned this pull request Feb 26, 2025

Index Repair Processes (Backfill-ectomy, fix canonical_url) #368

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ability to remove archive documents from ElasticSearch using original URL #364

Add ability to remove archive documents from ElasticSearch using original URL #364

m453h commented Jan 27, 2025 •

edited

Loading

thepsalmist left a comment

philbudne left a comment

m453h commented Jan 29, 2025

philbudne commented Jan 29, 2025

philbudne commented Jan 30, 2025

kilemensi left a comment

thepsalmist left a comment

philbudne left a comment

philbudne commented Feb 4, 2025 via email

m453h commented Feb 4, 2025

pgulley commented Feb 4, 2025

philbudne commented Feb 5, 2025 via email

m453h commented Feb 9, 2025

kilemensi left a comment •

edited

Loading

kilemensi Feb 14, 2025 •

edited

Loading

m453h Feb 14, 2025

kilemensi Feb 14, 2025

m453h Feb 14, 2025

m453h Feb 18, 2025

philbudne Feb 15, 2025

m453h Feb 18, 2025

philbudne Feb 18, 2025

m453h Feb 20, 2025

philbudne left a comment

m453h commented Feb 24, 2025

pgulley left a comment

Add ability to remove archive documents from ElasticSearch using original URL #364

Add ability to remove archive documents from ElasticSearch using original URL #364

Conversation

m453h commented Jan 27, 2025 • edited Loading

1. Creating a list of WARC files to be processed:

2. Creating a list of URLs to be deleted:

3. Deletion of documents from Elasticsearch

thepsalmist left a comment

Choose a reason for hiding this comment

philbudne left a comment

Choose a reason for hiding this comment

m453h commented Jan 29, 2025

philbudne commented Jan 29, 2025

philbudne commented Jan 30, 2025

kilemensi left a comment

Choose a reason for hiding this comment

thepsalmist left a comment

Choose a reason for hiding this comment

philbudne left a comment

Choose a reason for hiding this comment

philbudne commented Feb 4, 2025 via email

m453h commented Feb 4, 2025

pgulley commented Feb 4, 2025

philbudne commented Feb 5, 2025 via email

m453h commented Feb 9, 2025

kilemensi left a comment • edited Loading

Choose a reason for hiding this comment

kilemensi Feb 14, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

philbudne left a comment

Choose a reason for hiding this comment

m453h commented Feb 24, 2025

pgulley left a comment

Choose a reason for hiding this comment

m453h commented Jan 27, 2025 •

edited

Loading

kilemensi left a comment •

edited

Loading

kilemensi Feb 14, 2025 •

edited

Loading