Improve processing efficiency in check_parked.sh #409

friendly-bits · 2025-01-12T14:49:29Z

Hi, here you have 2 fairly simple modifications which should improve the efficiency:

avoid creating a subshell in a loop (creating subshells is slow, so better to avoid them in loops)
replace the slow builtin 'while read -r' loop + calls to 'sed -i' used to strip subdomains with awk script

Functionally this should work same as before, except somewhat faster.

jarelllama · 2025-01-12T16:19:35Z

Thanks! I also got to check why my test workflow isn't running for this push request. It tests the functionality of the parked check.

friendly-bits · 2025-01-12T18:09:01Z

Unfortunately, I have 0 experience with Github actions, so I couldn't test the script. I did test the awk code in isolation and it seemed fine.

The script now checks for gawk presence - possible that it's missing?

friendly-bits · 2025-01-12T19:00:53Z

I managed to enable and run the action "Check for parked domains" in my fork and it seems to be running with correct output:

Run bash scripts/check_parked.sh part1
[info] Processing file domains.tmp
[start] Analyzing 151944 entries for parked domains
[info] Found parked domain: deeepend-linkednrepos-ecrufkvroie.pages.dev
[info] Found parked domain: www.zwwwcsg.freewebhostmost.com
[info] Found parked domain: balcaovirtual-cgd.com

etc

jarelllama · 2025-01-13T04:35:16Z

Try running the test-functions action. You can see the test results for removing parked domains and adding parked domains.

friendly-bits · 2025-01-13T04:40:27Z

Unfortunately, I can't do it right now. While the "Check for parked domains"action was running, I got this notification from Github: "GitHub Actions is currently disabled for your account. Please reach out to GitHub Support for assistance." Reach out I did, waiting for a reply. Not sure what caused this. But for now, I can't run any actions.

jarelllama · 2025-01-13T06:45:52Z

@friendly-bits for 0877484 I don't think looping through config/subdomains.txt is slow considering the file only has 18 subdomains to loop through. I admire your solution, although I'll have to take some time to understand it haha...

This subdomain removal is also done in my other scripts (retrieve_domains.sh, validate_domains.sh, etc) so I'll keep your solution in mind if I want to improve subdomain removal for all scripts.

jarelllama · 2025-01-13T06:48:06Z

Let me know what you think about the subdomain removal and do review my commit in 40bfa64

friendly-bits · 2025-01-13T07:43:48Z

review my commit in 40bfa64

I believe that functionally it's the same as my commit e2c8eab, except you changed the variable name from lines_cnt to lines. Looks fine, although I do prefer lines_cnt because of better code readability. Variable lines might be mistaken for storing the actual lines, while lines_cnt clearly stores lines count. These things don't matter as long as you don't have to maintain the code some undefined time from now. If you do then using more specific variable names will make maintenance easier and help to avoid bugs. As an unrelated side note, I think this script relies too much on global variables, which degrades code readability and invites logic bugs. Where possible, use arguments when calling a function, rather than storing a value in a global variable and using it in a function.

I'll reply about subdomains removal next.

friendly-bits · 2025-01-13T08:41:15Z

I don't think looping through config/subdomains.txt is slow considering the file only has 18 subdomains to loop through.

Let's take a moment to analyze what current code is doing.

It reads the file config/subdomains.txt line-by-line in a while .. do loop, which is slow. If you were then processing each line with shell builtins then probably the low line count in that file would make this faster than calling a binary (i.e. awk) once because calling a binary has an overhead. However, this is not the case here.
It then calls a binary (sed) - this has an overhead which is multiplied by the number of calls (in this case 18).
sed performs i/o on file parked.tmp, twice. Once for reading and once for writing. Likely this has the largest overhead here, and it's also multiplied by 18.
sed processes the contents of the file with a regex, which of course has overhead which is multiplied by 18.
Finally, sort performs i/o again, this time not in a loop.

Now let's analyze the gawk-based solution.

It calls the awk binary once, which has overhead.
It reads the file config/subdomains.txt in one go and stores each line as a key in an array. While this is also a loop, awk is orders of magnitude faster in performing this task than shell.
It performs i/o on the file parked.tmp once. The output is piped into sort and then redirected to file parked-removed-subdomains.tmp. So that's a second binary call and a second i/o. The pipe also has overhead (about the same as creating a subshell) but it should be significantly smaller than i/o overhead. The mv command also has some overhead but it consists mostly of the binary call overhead (renaming the file is fast).
awk processes the contents of the file parked.tmp once, running a loop on each line and comparing the first .-separated segment to all keys in the array, and if it finds a match then subdomain is removed from the string. This is not as fast as sed processing but still pretty fast if implemented correctly (which it is). And most importantly, this is only done once, rather than 18 times.

While generally sed is faster than gawk for comparable tasks, 18 calls to sed are much slower than 1 call to awk, and processing same data with sed 18 times is much slower than processing it once with awk.

Now to put things in proportion, most likely the total resulting overheads are dwarfed by the main source of slowness, which is fetching data over the internet for each domain on a large list of domains. So in the large scheme of things, in this particular case, the efficiency of remove_parked probably barely matters. Still IMO better code is better.

Hope all of this makes sense.

jarelllama · 2025-01-13T11:26:39Z

Thanks for the clear explanation @friendly-bits I'll look more into the code later.

friendly-bits · 2025-01-13T12:28:39Z

Sorry I made a mess with the merge. (probably this happened because I didn't account for the commit you added and I didn't have locally)

After explaining the awk script, I suddenly realized that there is no need for it to have a loop. Rather, it can simply check whether the first .-separated segment of the domain matches one of the keys of the array containing the subdomains (I also renamed the array to subdom for better readability). In awk, array keys are hashed, so lookup in those keys is very fast.

friendly-bits · 2025-01-13T12:33:09Z

P.s. I love optimizing useful things, so feel free to ping me whenever you need help with this.

jarelllama · 2025-01-14T05:00:16Z

@friendly-bits if you're using VS Code do you know of any syntax highlighting extensions for awk? Also is there a reason for using gawk instead of mawk, which I'm using on the other scripts?

friendly-bits · 2025-01-14T09:56:05Z

do you know of any syntax highlighting extensions for awk?

Unfortunately, no.

is there a reason for using gawk instead of mawk, which I'm using on the other scripts?

Only because I know that gawk is fast but I never tested mawk performance. The Busybox awk is very slow in comparison. Probably mawk is fine.

friendly-bits · 2025-01-14T13:31:51Z

For a more detailed reply: normally I simply copy-paste the awk command and run it on some test data. When I have a syntax error, I find that gawk is quite good at pointing to it. Initially I had a bit of trouble with awk syntax but no longer. Now I find the syntax quite simple and logical.

friendly-bits added 2 commits January 12, 2025 15:45

check_parked: avoid subshell in loop

e2c8eab

check_parked: strip subdomains with gawk

0877484

friendly-bits force-pushed the main branch from 22c0993 to 0877484 Compare January 12, 2025 15:13

check_parked.sh: only count lines of the first split file

40bfa64

friendly-bits added 3 commits January 13, 2025 14:19

check_parked: remove_parked: improve awk command

3244539

Merge branch 'main' of https://github.com/friendly-bits/Scam-Blocklist

9d40d28

check_parked: remove_parked: improve awk arr name

9978bda

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve processing efficiency in check_parked.sh #409

Improve processing efficiency in check_parked.sh #409

friendly-bits commented Jan 12, 2025 •

edited

Loading

jarelllama commented Jan 12, 2025

friendly-bits commented Jan 12, 2025 •

edited

Loading

friendly-bits commented Jan 12, 2025 •

edited

Loading

jarelllama commented Jan 13, 2025

friendly-bits commented Jan 13, 2025

jarelllama commented Jan 13, 2025

jarelllama commented Jan 13, 2025

friendly-bits commented Jan 13, 2025

friendly-bits commented Jan 13, 2025 •

edited

Loading

jarelllama commented Jan 13, 2025

friendly-bits commented Jan 13, 2025 •

edited

Loading

friendly-bits commented Jan 13, 2025

jarelllama commented Jan 14, 2025

friendly-bits commented Jan 14, 2025

friendly-bits commented Jan 14, 2025

Improve processing efficiency in check_parked.sh #409

Are you sure you want to change the base?

Improve processing efficiency in check_parked.sh #409

Conversation

friendly-bits commented Jan 12, 2025 • edited Loading

jarelllama commented Jan 12, 2025

friendly-bits commented Jan 12, 2025 • edited Loading

friendly-bits commented Jan 12, 2025 • edited Loading

jarelllama commented Jan 13, 2025

friendly-bits commented Jan 13, 2025

jarelllama commented Jan 13, 2025

jarelllama commented Jan 13, 2025

friendly-bits commented Jan 13, 2025

friendly-bits commented Jan 13, 2025 • edited Loading

jarelllama commented Jan 13, 2025

friendly-bits commented Jan 13, 2025 • edited Loading

friendly-bits commented Jan 13, 2025

jarelllama commented Jan 14, 2025

friendly-bits commented Jan 14, 2025

friendly-bits commented Jan 14, 2025

friendly-bits commented Jan 12, 2025 •

edited

Loading

friendly-bits commented Jan 12, 2025 •

edited

Loading

friendly-bits commented Jan 12, 2025 •

edited

Loading

friendly-bits commented Jan 13, 2025 •

edited

Loading

friendly-bits commented Jan 13, 2025 •

edited

Loading