Index Repair Processes (Backfill-ectomy, fix canonical_url) #368

pgulley · 2025-02-26T16:57:01Z

This is a meta issue for two kinds of invasive surgery we've prepared scripts for running on our production ES index. Reorganizing into a single issue for the actul

The Backfillectomy (Issue Re-filling the feb-may 2022 "dip" using canonical URL extraction #353): We have a range of dates where our original attempts to cover left us with a noticable dip. We determined that we could use the Canonical Url extraction method to refill that dip, but need to delete all of the data from the first attempt in order to not introduce duplicates.

We have scripts merged to run that process (Add ability to remove archive documents from ElasticSearch using original URL #364, 2021 Historical Re-ingestion (NSF 3.1.1) #306) and have generated the list of urls to be deleted
Pending a confirmation from @philbudne that the contents of the 'delete list' are accurate
Run a small test sample to monitor the cost of the delete and reindex operation- choose the first ~1000 urls and time the process of deleting them, and project a runtime for the entire delete list. @m453h
Run the delete process @m453h
Re-run the index process on that date range using canonical domain.

2.Fix Canonical Domain (#345): We indexed a small number of stories with the wrong canonical url field. We have a script to address this ( from #348), it just needs to be run.

Run the Canonical domain fix @m453h

philbudne · 2025-02-27T16:58:56Z

Here are my notes from looking at the listings in @m453h delete_list.tar.gz
(the .base files had the URL scheme & bucket removed, for comparison):

### from CSVs:

# S3 only, use S3
#s3/container-cf94b52abe5a-2024-05-27_2024-05-28.txt.base
#s3/container-cefd3fdce464-2024-05-29_2024-06-04.txt.base

# S3 has 772? more (out of 2138) use S3 [at least for those?]
#diff s3/mccsv-s3-mc_search-00002-2024-05-22_2024-06-22.txt.base b2/mccsv-b2-mc_search-00002-2024-05-31_2024-06-22.txt.base

# S3 only, use S3
#s3/container-cefd3fdce464-2024-06-05.txt.base

# identical, use B2
#diff -u s3/container-0c501ed61cf4-2024-06-05.txt.base b2/container-0c501ed61cf4-2024-06-05.txt.base 
#diff -u s3/container-446d55936e82-2024-06-05.txt.base b2/container-446d55936e82-2024-06-05.txt.base
#diff -u s3/container-0c501ed61cf4-2024-06-06_2024-06-09.txt.base b2/container-0c501ed61cf4-2024-06-06_2024-06-09.txt.base
#diff -u s3/container-7e1b47c305f1-2024-06-09.txt.base b2/container-7e1b47c305f1-2024-06-09.txt.base

# B2 has two more archives, use B2
#diff -u s3/container-6c55aaf9daaa-2024-06-11_2024-06-20.txt.base b2/container-6c55aaf9daaa-2024-06-11-2024-06-20.txt.base

# identical, use B2
#diff -u s3/mccsv-s3-mc_search-00003-2024-06-23_2024-06-27.txt.base b2/mccsv-b2-mc_search-00003-2024-06-23_2024-06-27.txt.base

### from RSS:

# B2 has one more archive, use B2
#diff -u s3/mcrss-s3-mc_search-00002-2024-06-20_2024-06-22.txt.base b2/mcrss-b2-mc_search-00002-2024-06-20_2024-06-22.txt.base

# B2 has 18 more, use B2
#diff -u s3/mcrss-s3-mc_search-00003-2024-06-23_2024-08-16.txt.base b2/mcrss-b2-mc_search-00003-2024-06-23_2024-08-16.txt.base

philbudne · 2025-02-27T17:20:06Z

On the "how long will it take" front, looking at the archives we need to load after the deletes:

pbudne@ifill:~$ b2 ls --recursive mediacloud-indexer-archive 2024/12 | grep mchist2022 | wc -l
12327

It took 7.6 seconds to fetch one:

pbudne@ifill:~$ time b2 download-file b2://mediacloud-indexer-archive/2024/12/31/mchist2022-20241231210622-4180-9d4f17f02953.warc.gz  zzz
File name:           2024/12/31/mchist2022-20241231210622-4180-9d4f17f02953.warc.gz
...
Output file path:    /nfs/ang/users/pbudne/zzz
File size:           200174489
Content type:        binary/octet-stream
...
200M/200M [00:06<00:00, 29.3MB/s]

real     0m7.603s
user    0m1.914s
sys	   0m0.643s

With just one arch-queuer running:

12327 archives x 8 sec = 98616 seconds = 1643 minutes = 27 hours
so do-able without pre-downloading the archives....

pgulley · 2025-02-27T19:14:55Z

That looks good to me, lets hit go.

m453h · 2025-02-28T14:08:00Z

Thanks a lot @philbudne for the thorough review.

I had two questions before starting the actual process:

Would processing the WARC files to get the URLs for deletion before Saturday have any negative impact on our end ? (I was thinking we could use Saturday to just remove the stories from ES)
Should the script be executed on ifil ? and how many queuers should we start ? (It looks doable with 1 queuer but starting with >= 3 queuers might save us a couple of hours)

philbudne · 2025-02-28T15:14:01Z

Yes, collecting the URLs first is a good idea! If it progresses too slowly, it would give us the chance to let it run, and do all the deletions once the URL files are done. bernstein.angwin is a server we've reserved for doing batch/background work for historical ingest, and is currently idle. The only things to watch are: 1. bernstein doesn't have a huge root partition/disk (this primarily a concern when running pipelines under docker, because rabbitmq and worker-data volumes are there and can fill the disk). 2. The UMass network connection is (ISTR) 40Gbit/s, and we have, at times used more than half of it (but should try not to)! You can view bernstein's network usage in the "Server stats" dashboard, reported in Bytes/minute; my math says 20Gbit/s would show up as 150MByte/minute. Other than those constraints, bernstein has 198GB of memory and 16 CPU cores, (hyperthreading is enabled, so the kernel sees 32 CPUs). I'd start with url-lister, see what the network usage is, and add processes slowly. Querying the sqlite3 "tracker" file is helpful to see that things are going as planned. This will show the most recent activity: echo 'select name, status, datetime(ts, 'unixepoch') from files order by ts desc;' | sqlite3 .../file-tracker.sqlite3 To confirm: the goal is that for each remote WARC file processed, we get a separate file with the URLs in that WARC. We'll want to keep the output files "for the record" even after we erase the stories from ES.

pgulley · 2025-02-28T16:28:39Z

Looks like the right process to me! Titrating the number of processes as we get a sense of the network impact is smart.

m453h · 2025-03-02T08:59:39Z

I tested deleting URLs from a processed WARC file (mccsv-20240612203625-26-2009dc0222f3.warc.gz.txt) containing 5,000 URLs. The process took 16 seconds, using batch deletions of 1,000 URLs at a time with random delays of 0.5 to 3 seconds between batches.

Considering all extracted WARC files, we have a total of 35,172,667 URLs. Based on this, the process will take approximately 112,553 seconds, or about 31.26 hours.

pgulley assigned m453h and philbudne Feb 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Index Repair Processes (Backfill-ectomy, fix canonical_url) #368

Index Repair Processes (Backfill-ectomy, fix canonical_url) #368

pgulley commented Feb 26, 2025 •

edited by m453h

Loading

philbudne commented Feb 27, 2025

philbudne commented Feb 27, 2025 •

edited

Loading

pgulley commented Feb 27, 2025

m453h commented Feb 28, 2025

philbudne commented Feb 28, 2025 via email

pgulley commented Feb 28, 2025

m453h commented Mar 2, 2025

Index Repair Processes (Backfill-ectomy, fix canonical_url) #368

Index Repair Processes (Backfill-ectomy, fix canonical_url) #368

Comments

pgulley commented Feb 26, 2025 • edited by m453h Loading

philbudne commented Feb 27, 2025

philbudne commented Feb 27, 2025 • edited Loading

pgulley commented Feb 27, 2025

m453h commented Feb 28, 2025

philbudne commented Feb 28, 2025 via email

pgulley commented Feb 28, 2025

m453h commented Mar 2, 2025

pgulley commented Feb 26, 2025 •

edited by m453h

Loading

philbudne commented Feb 27, 2025 •

edited

Loading