Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve processing efficiency in check_parked.sh #409

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

friendly-bits
Copy link

@friendly-bits friendly-bits commented Jan 12, 2025

Hi, here you have 2 fairly simple modifications which should improve the efficiency:

  • avoid creating a subshell in a loop (creating subshells is slow, so better to avoid them in loops)
  • replace the slow builtin 'while read -r' loop + calls to 'sed -i' used to strip subdomains with awk script

Functionally this should work same as before, except somewhat faster.

@jarelllama
Copy link
Owner

Thanks! I also got to check why my test workflow isn't running for this push request. It tests the functionality of the parked check.

@friendly-bits
Copy link
Author

friendly-bits commented Jan 12, 2025

Unfortunately, I have 0 experience with Github actions, so I couldn't test the script. I did test the awk code in isolation and it seemed fine.

The script now checks for gawk presence - possible that it's missing?

@friendly-bits
Copy link
Author

friendly-bits commented Jan 12, 2025

I managed to enable and run the action "Check for parked domains" in my fork and it seems to be running with correct output:

Run bash scripts/check_parked.sh part1
[info] Processing file domains.tmp
[start] Analyzing 151944 entries for parked domains
[info] Found parked domain: deeepend-linkednrepos-ecrufkvroie.pages.dev
[info] Found parked domain: www.zwwwcsg.freewebhostmost.com
[info] Found parked domain: balcaovirtual-cgd.com

etc

@jarelllama
Copy link
Owner

Try running the test-functions action. You can see the test results for removing parked domains and adding parked domains.

@friendly-bits
Copy link
Author

Unfortunately, I can't do it right now. While the "Check for parked domains"action was running, I got this notification from Github: "GitHub Actions is currently disabled for your account. Please reach out to GitHub Support for assistance." Reach out I did, waiting for a reply. Not sure what caused this. But for now, I can't run any actions.

@jarelllama
Copy link
Owner

@friendly-bits for 0877484 I don't think looping through config/subdomains.txt is slow considering the file only has 18 subdomains to loop through. I admire your solution, although I'll have to take some time to understand it haha...

This subdomain removal is also done in my other scripts (retrieve_domains.sh, validate_domains.sh, etc) so I'll keep your solution in mind if I want to improve subdomain removal for all scripts.

@jarelllama
Copy link
Owner

Let me know what you think about the subdomain removal and do review my commit in 40bfa64

@friendly-bits
Copy link
Author

review my commit in 40bfa64

I believe that functionally it's the same as my commit e2c8eab, except you changed the variable name from lines_cnt to lines. Looks fine, although I do prefer lines_cnt because of better code readability. Variable lines might be mistaken for storing the actual lines, while lines_cnt clearly stores lines count. These things don't matter as long as you don't have to maintain the code some undefined time from now. If you do then using more specific variable names will make maintenance easier and help to avoid bugs. As an unrelated side note, I think this script relies too much on global variables, which degrades code readability and invites logic bugs. Where possible, use arguments when calling a function, rather than storing a value in a global variable and using it in a function.

I'll reply about subdomains removal next.

@friendly-bits
Copy link
Author

friendly-bits commented Jan 13, 2025

I don't think looping through config/subdomains.txt is slow considering the file only has 18 subdomains to loop through.

Let's take a moment to analyze what current code is doing.

  1. It reads the file config/subdomains.txt line-by-line in a while .. do loop, which is slow. If you were then processing each line with shell builtins then probably the low line count in that file would make this faster than calling a binary (i.e. awk) once because calling a binary has an overhead. However, this is not the case here.
  2. It then calls a binary (sed) - this has an overhead which is multiplied by the number of calls (in this case 18).
  3. sed performs i/o on file parked.tmp, twice. Once for reading and once for writing. Likely this has the largest overhead here, and it's also multiplied by 18.
  4. sed processes the contents of the file with a regex, which of course has overhead which is multiplied by 18.
  5. Finally, sort performs i/o again, this time not in a loop.

Now let's analyze the gawk-based solution.

  1. It calls the awk binary once, which has overhead.
  2. It reads the file config/subdomains.txt in one go and stores each line as a key in an array. While this is also a loop, awk is orders of magnitude faster in performing this task than shell.
  3. It performs i/o on the file parked.tmp once. The output is piped into sort and then redirected to file parked-removed-subdomains.tmp. So that's a second binary call and a second i/o. The pipe also has overhead (about the same as creating a subshell) but it should be significantly smaller than i/o overhead. The mv command also has some overhead but it consists mostly of the binary call overhead (renaming the file is fast).
  4. awk processes the contents of the file parked.tmp once, running a loop on each line and comparing the first .-separated segment to all keys in the array, and if it finds a match then subdomain is removed from the string. This is not as fast as sed processing but still pretty fast if implemented correctly (which it is). And most importantly, this is only done once, rather than 18 times.

While generally sed is faster than gawk for comparable tasks, 18 calls to sed are much slower than 1 call to awk, and processing same data with sed 18 times is much slower than processing it once with awk.

Now to put things in proportion, most likely the total resulting overheads are dwarfed by the main source of slowness, which is fetching data over the internet for each domain on a large list of domains. So in the large scheme of things, in this particular case, the efficiency of remove_parked probably barely matters. Still IMO better code is better.

Hope all of this makes sense.

@jarelllama
Copy link
Owner

Thanks for the clear explanation @friendly-bits I'll look more into the code later.

@friendly-bits
Copy link
Author

friendly-bits commented Jan 13, 2025

Sorry I made a mess with the merge. (probably this happened because I didn't account for the commit you added and I didn't have locally)

After explaining the awk script, I suddenly realized that there is no need for it to have a loop. Rather, it can simply check whether the first .-separated segment of the domain matches one of the keys of the array containing the subdomains (I also renamed the array to subdom for better readability). In awk, array keys are hashed, so lookup in those keys is very fast.

@friendly-bits
Copy link
Author

P.s. I love optimizing useful things, so feel free to ping me whenever you need help with this.

@jarelllama
Copy link
Owner

@friendly-bits if you're using VS Code do you know of any syntax highlighting extensions for awk? Also is there a reason for using gawk instead of mawk, which I'm using on the other scripts?

@friendly-bits
Copy link
Author

do you know of any syntax highlighting extensions for awk?

Unfortunately, no.

is there a reason for using gawk instead of mawk, which I'm using on the other scripts?

Only because I know that gawk is fast but I never tested mawk performance. The Busybox awk is very slow in comparison. Probably mawk is fine.

@friendly-bits
Copy link
Author

For a more detailed reply: normally I simply copy-paste the awk command and run it on some test data. When I have a syntax error, I find that gawk is quite good at pointing to it. Initially I had a bit of trouble with awk syntax but no longer. Now I find the syntax quite simple and logical.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants