Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/update example books #423

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

M3ssman
Copy link
Contributor

@M3ssman M3ssman commented Apr 25, 2024

Contains some modifications to the example section, altogether with some minor updates.

Targets some current difficulties in the actual implementation:

  • docker-compose turned into docker compose
  • needed to reduce process pool size, actual setup always freezes machine
  • if only one single book missing process starts from scratch again, hence individual checks

If you find it useful, feel to merge 🙂

Copy link
Member

@jbaiter jbaiter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot!

I'm not sure what you mean by if only one single book missing process starts from scratch again, do you mean the script should only index files that are missing on disk/in the index?

If so, I think the best way would be to add a --only-missing flag that before indexing checks which docs are already in the index and then skips those.

- ~16GiB of free storage
2. Launch the containers: `docker-compose up -d`
1. Clear local data dir if exists: `rm -vr data`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is neccessary and will cause unneccessary traffic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your absolutely right! I just noticed this can do real harm. If the directory has been deleted and the containers create them from scratch it is going ill.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concerning if only one single book missing ... : The current flow is like this: one book is missing, then tar requests starts again. I tried the example in a place with slow net speed, lots of connection errs and required several tries.
Before altering this I wasn't able to get just the books in place.

example/README.md Outdated Show resolved Hide resolved
2. Launch the containers: `docker-compose up -d`
1. Clear local data dir if exists: `rm -vr data`
2. Launch containers: `docker compose up -d`
- Check permissions for subdirectory `./data` since there goes the load
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Check permissions for subdirectory `./data` since there goes the load
- Make sure you have write permissions for the subdirectory `./data` since this is where the data will be downloaded to

- If only interested in books: `./ingest_books_only.py`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer a --books-only flag to the ingest.py script instead of creating a second one that shares much of the code with the existing script.



# turn on/off diagnostic information
LOG_LEVEL = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use the logging module from the python stdlib instead of the if LOG_LEVEL > ... + print pattern, makes for much less verbose/nested code.



if __name__ == '__main__':
N_CORES = os.cpu_count() // 6 if os.cpu_count() is not None else 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer if this wasn't hardcoded and instead exposed as a --num-workers (just a suggestion) flag.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it okay to evaluate those flags with stdlib argparse ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, what was the rationale behind generate_batches generator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or to send a batch of documents in one request to the index?

Copy link
Member

@jbaiter jbaiter Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, what was the rationale behind generate_batches generator?

Increasing parallelism and thus reduce indexing time. By ingesting multiple documents at the same time we can make use of more than one thread for data processing in Solr.

@M3ssman M3ssman requested a review from jbaiter April 26, 2024 11:14
@jbaiter
Copy link
Member

jbaiter commented Apr 26, 2024

First of all, thanks for incorporating the feedback :-)
However, that's a pretty large diff, that touches almost every part of the script, for a rather small set of changes.
Can you:

  • keep the existing style and general structure of the script
  • keep the changes you need localized to only the places that need them?

And I'd only use the logging API for debugging stuff, the stdout output was intentionally kept rather brief and readable. Spamming a log line for every iteration is kind of excessive.



# turn on/off diagnostic information
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# turn on/off diagnostic information

return True
return False
def main_ingest(the_args):
_n_worker = the_args.num_workers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't have to mark variables as private in a function body, they're private to the function anyway.

if not vol_path.exists():
return True
return False
def main_ingest(the_args):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function shouldn't have to know about argparse, just pass the parameters directly, there are not that many 🙃

for fut in as_completed(futs):
fut.result()
print("\n")
PARSER = ArgumentParser(description='ingest example data into SOLR')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

those aren't really constants either, even if they're in the global scope.

Comment on lines 335 to 337
LOGGER = Logger(LOGGER_NAME, _calculate_log_level(ARGS.log_level))
STDOUT_HANDLER = StreamHandler(sys.stdout)
STDOUT_HANDLER.setFormatter(Formatter("%(asctime)s [%(levelname)s] %(message)s", datefmt="%Y-%m-%dT%H:%M:%S"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably stick to logging.basicConfig and not even create a dedicated logger, since this is just a small utility script:

Suggested change
LOGGER = Logger(LOGGER_NAME, _calculate_log_level(ARGS.log_level))
STDOUT_HANDLER = StreamHandler(sys.stdout)
STDOUT_HANDLER.setFormatter(Formatter("%(asctime)s [%(levelname)s] %(message)s", datefmt="%Y-%m-%dT%H:%M:%S"))
logging.basicConfig(level=logging.DEBUG if args.debug else logging.WARN)

then you can just call logging.debug in your code.

fut.result()
print("\n")
PARSER = ArgumentParser(description='ingest example data into SOLR')
PARSER.add_argument('--log-level', help='like "debug", "info", "error" (default:info)', required=False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we only use logging for debug statements, I think we only need a --debug flag to toggle debug logging on.

@M3ssman
Copy link
Contributor Author

M3ssman commented Apr 26, 2024

Sorry, I didn't realize your latest remarks at first sight!
But anyway, it's all only a proposal. If you can't go with, it's okay and just leave it out.

@M3ssman M3ssman force-pushed the feat/update-example-books branch from 5aca1ab to b6f11f0 Compare April 27, 2024 08:24
@M3ssman M3ssman requested a review from jbaiter April 29, 2024 06:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants