Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index Repair Processes (Backfill-ectomy, fix canonical_url) #368

Open
4 of 6 tasks
pgulley opened this issue Feb 26, 2025 · 7 comments
Open
4 of 6 tasks

Index Repair Processes (Backfill-ectomy, fix canonical_url) #368

pgulley opened this issue Feb 26, 2025 · 7 comments
Assignees

Comments

@pgulley
Copy link
Member

pgulley commented Feb 26, 2025

This is a meta issue for two kinds of invasive surgery we've prepared scripts for running on our production ES index. Reorganizing into a single issue for the actul

  1. The Backfillectomy (Issue Re-filling the feb-may 2022 "dip" using canonical URL extraction #353): We have a range of dates where our original attempts to cover left us with a noticable dip. We determined that we could use the Canonical Url extraction method to refill that dip, but need to delete all of the data from the first attempt in order to not introduce duplicates.

2.Fix Canonical Domain (#345): We indexed a small number of stories with the wrong canonical url field. We have a script to address this ( from #348), it just needs to be run.

  • Run the Canonical domain fix @m453h
@philbudne
Copy link
Contributor

Here are my notes from looking at the listings in @m453h delete_list.tar.gz
(the .base files had the URL scheme & bucket removed, for comparison):

### from CSVs:

# S3 only, use S3
#s3/container-cf94b52abe5a-2024-05-27_2024-05-28.txt.base
#s3/container-cefd3fdce464-2024-05-29_2024-06-04.txt.base

# S3 has 772? more (out of 2138) use S3 [at least for those?]
#diff s3/mccsv-s3-mc_search-00002-2024-05-22_2024-06-22.txt.base b2/mccsv-b2-mc_search-00002-2024-05-31_2024-06-22.txt.base

# S3 only, use S3
#s3/container-cefd3fdce464-2024-06-05.txt.base

# identical, use B2
#diff -u s3/container-0c501ed61cf4-2024-06-05.txt.base b2/container-0c501ed61cf4-2024-06-05.txt.base 
#diff -u s3/container-446d55936e82-2024-06-05.txt.base b2/container-446d55936e82-2024-06-05.txt.base
#diff -u s3/container-0c501ed61cf4-2024-06-06_2024-06-09.txt.base b2/container-0c501ed61cf4-2024-06-06_2024-06-09.txt.base
#diff -u s3/container-7e1b47c305f1-2024-06-09.txt.base b2/container-7e1b47c305f1-2024-06-09.txt.base

# B2 has two more archives, use B2
#diff -u s3/container-6c55aaf9daaa-2024-06-11_2024-06-20.txt.base b2/container-6c55aaf9daaa-2024-06-11-2024-06-20.txt.base

# identical, use B2
#diff -u s3/mccsv-s3-mc_search-00003-2024-06-23_2024-06-27.txt.base b2/mccsv-b2-mc_search-00003-2024-06-23_2024-06-27.txt.base

### from RSS:

# B2 has one more archive, use B2
#diff -u s3/mcrss-s3-mc_search-00002-2024-06-20_2024-06-22.txt.base b2/mcrss-b2-mc_search-00002-2024-06-20_2024-06-22.txt.base

# B2 has 18 more, use B2
#diff -u s3/mcrss-s3-mc_search-00003-2024-06-23_2024-08-16.txt.base b2/mcrss-b2-mc_search-00003-2024-06-23_2024-08-16.txt.base

@philbudne
Copy link
Contributor

philbudne commented Feb 27, 2025

On the "how long will it take" front, looking at the archives we need to load after the deletes:

pbudne@ifill:~$ b2 ls --recursive mediacloud-indexer-archive 2024/12 | grep mchist2022 | wc -l
12327

It took 7.6 seconds to fetch one:

pbudne@ifill:~$ time b2 download-file b2://mediacloud-indexer-archive/2024/12/31/mchist2022-20241231210622-4180-9d4f17f02953.warc.gz  zzz
File name:           2024/12/31/mchist2022-20241231210622-4180-9d4f17f02953.warc.gz
...
Output file path:    /nfs/ang/users/pbudne/zzz
File size:           200174489
Content type:        binary/octet-stream
...
200M/200M [00:06<00:00, 29.3MB/s]

real     0m7.603s
user    0m1.914s
sys	   0m0.643s

With just one arch-queuer running:

12327 archives x 8 sec = 98616 seconds = 1643 minutes = 27 hours
so do-able without pre-downloading the archives....

@pgulley
Copy link
Member Author

pgulley commented Feb 27, 2025

That looks good to me, lets hit go.

@m453h
Copy link
Contributor

m453h commented Feb 28, 2025

Thanks a lot @philbudne for the thorough review.

I had two questions before starting the actual process:

  1. Would processing the WARC files to get the URLs for deletion before Saturday have any negative impact on our end ? (I was thinking we could use Saturday to just remove the stories from ES)

  2. Should the script be executed on ifil ? and how many queuers should we start ? (It looks doable with 1 queuer but starting with >= 3 queuers might save us a couple of hours)

@philbudne
Copy link
Contributor

philbudne commented Feb 28, 2025 via email

@pgulley
Copy link
Member Author

pgulley commented Feb 28, 2025

Looks like the right process to me! Titrating the number of processes as we get a sense of the network impact is smart.

@m453h
Copy link
Contributor

m453h commented Mar 2, 2025

I tested deleting URLs from a processed WARC file (mccsv-20240612203625-26-2009dc0222f3.warc.gz.txt) containing 5,000 URLs. The process took 16 seconds, using batch deletions of 1,000 URLs at a time with random delays of 0.5 to 3 seconds between batches.

Considering all extracted WARC files, we have a total of 35,172,667 URLs. Based on this, the process will take approximately 112,553 seconds, or about 31.26 hours.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants