-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Index Repair Processes (Backfill-ectomy, fix canonical_url) #368
Comments
Here are my notes from looking at the listings in @m453h delete_list.tar.gz
|
On the "how long will it take" front, looking at the archives we need to load after the deletes:
It took 7.6 seconds to fetch one:
With just one arch-queuer running: 12327 archives x 8 sec = 98616 seconds = 1643 minutes = 27 hours |
That looks good to me, lets hit go. |
Thanks a lot @philbudne for the thorough review. I had two questions before starting the actual process:
|
Yes, collecting the URLs first is a good idea!
If it progresses too slowly, it would give us the chance to let it
run, and do all the deletions once the URL files are done.
bernstein.angwin is a server we've reserved for doing batch/background
work for historical ingest, and is currently idle.
The only things to watch are:
1. bernstein doesn't have a huge root partition/disk (this primarily a
concern when running pipelines under docker, because rabbitmq and
worker-data volumes are there and can fill the disk).
2. The UMass network connection is (ISTR) 40Gbit/s, and we have, at
times used more than half of it (but should try not to)!
You can view bernstein's network usage in the "Server stats"
dashboard, reported in Bytes/minute; my math says 20Gbit/s would show
up as 150MByte/minute.
Other than those constraints, bernstein has 198GB of memory and 16 CPU
cores, (hyperthreading is enabled, so the kernel sees 32 CPUs).
I'd start with url-lister, see what the network usage is, and add
processes slowly.
Querying the sqlite3 "tracker" file is helpful to see that things are
going as planned. This will show the most recent activity:
echo 'select name, status, datetime(ts, 'unixepoch') from files order by ts desc;' | sqlite3 .../file-tracker.sqlite3
To confirm: the goal is that for each remote WARC file processed, we
get a separate file with the URLs in that WARC. We'll want to keep
the output files "for the record" even after we erase the stories from
ES.
|
Looks like the right process to me! Titrating the number of processes as we get a sense of the network impact is smart. |
I tested deleting URLs from a processed WARC file (mccsv-20240612203625-26-2009dc0222f3.warc.gz.txt) containing 5,000 URLs. The process took 16 seconds, using batch deletions of 1,000 URLs at a time with random delays of 0.5 to 3 seconds between batches. Considering all extracted WARC files, we have a total of 35,172,667 URLs. Based on this, the process will take approximately 112,553 seconds, or about 31.26 hours. |
This is a meta issue for two kinds of invasive surgery we've prepared scripts for running on our production ES index. Reorganizing into a single issue for the actul
2.Fix Canonical Domain (#345): We indexed a small number of stories with the wrong canonical url field. We have a script to address this ( from #348), it just needs to be run.
The text was updated successfully, but these errors were encountered: