-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/update example books #423
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot!
I'm not sure what you mean by if only one single book missing process starts from scratch again
, do you mean the script should only index files that are missing on disk/in the index?
If so, I think the best way would be to add a --only-missing
flag that before indexing checks which docs are already in the index and then skips those.
example/README.md
Outdated
- ~16GiB of free storage | ||
2. Launch the containers: `docker-compose up -d` | ||
1. Clear local data dir if exists: `rm -vr data` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is neccessary and will cause unneccessary traffic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your absolutely right! I just noticed this can do real harm. If the directory has been deleted and the containers create them from scratch it is going ill.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concerning if only one single book missing ...
: The current flow is like this: one book is missing, then tar requests starts again. I tried the example in a place with slow net speed, lots of connection errs and required several tries.
Before altering this I wasn't able to get just the books in place.
example/README.md
Outdated
2. Launch the containers: `docker-compose up -d` | ||
1. Clear local data dir if exists: `rm -vr data` | ||
2. Launch containers: `docker compose up -d` | ||
- Check permissions for subdirectory `./data` since there goes the load |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Check permissions for subdirectory `./data` since there goes the load | |
- Make sure you have write permissions for the subdirectory `./data` since this is where the data will be downloaded to |
example/README.md
Outdated
- If only interested in books: `./ingest_books_only.py` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer a --books-only
flag to the ingest.py
script instead of creating a second one that shares much of the code with the existing script.
example/ingest_books_only.py
Outdated
|
||
|
||
# turn on/off diagnostic information | ||
LOG_LEVEL = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use the logging
module from the python stdlib instead of the if LOG_LEVEL > ...
+ print
pattern, makes for much less verbose/nested code.
example/ingest_books_only.py
Outdated
|
||
|
||
if __name__ == '__main__': | ||
N_CORES = os.cpu_count() // 6 if os.cpu_count() is not None else 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer if this wasn't hardcoded and instead exposed as a --num-workers
(just a suggestion) flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it okay to evaluate those flags with stdlib argparse
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, what was the rationale behind generate_batches
generator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or to send a batch of documents in one request to the index?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, what was the rationale behind
generate_batches
generator?
Increasing parallelism and thus reduce indexing time. By ingesting multiple documents at the same time we can make use of more than one thread for data processing in Solr.
First of all, thanks for incorporating the feedback :-)
And I'd only use the |
example/ingest.py
Outdated
|
||
|
||
# turn on/off diagnostic information |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# turn on/off diagnostic information |
example/ingest.py
Outdated
return True | ||
return False | ||
def main_ingest(the_args): | ||
_n_worker = the_args.num_workers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't have to mark variables as private in a function body, they're private to the function anyway.
example/ingest.py
Outdated
if not vol_path.exists(): | ||
return True | ||
return False | ||
def main_ingest(the_args): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function shouldn't have to know about argparse, just pass the parameters directly, there are not that many 🙃
example/ingest.py
Outdated
for fut in as_completed(futs): | ||
fut.result() | ||
print("\n") | ||
PARSER = ArgumentParser(description='ingest example data into SOLR') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
those aren't really constants either, even if they're in the global scope.
example/ingest.py
Outdated
LOGGER = Logger(LOGGER_NAME, _calculate_log_level(ARGS.log_level)) | ||
STDOUT_HANDLER = StreamHandler(sys.stdout) | ||
STDOUT_HANDLER.setFormatter(Formatter("%(asctime)s [%(levelname)s] %(message)s", datefmt="%Y-%m-%dT%H:%M:%S")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd probably stick to logging.basicConfig
and not even create a dedicated logger, since this is just a small utility script:
LOGGER = Logger(LOGGER_NAME, _calculate_log_level(ARGS.log_level)) | |
STDOUT_HANDLER = StreamHandler(sys.stdout) | |
STDOUT_HANDLER.setFormatter(Formatter("%(asctime)s [%(levelname)s] %(message)s", datefmt="%Y-%m-%dT%H:%M:%S")) | |
logging.basicConfig(level=logging.DEBUG if args.debug else logging.WARN) |
then you can just call logging.debug
in your code.
example/ingest.py
Outdated
fut.result() | ||
print("\n") | ||
PARSER = ArgumentParser(description='ingest example data into SOLR') | ||
PARSER.add_argument('--log-level', help='like "debug", "info", "error" (default:info)', required=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we only use logging for debug statements, I think we only need a --debug
flag to toggle debug logging on.
Sorry, I didn't realize your latest remarks at first sight! |
5aca1ab
to
b6f11f0
Compare
Contains some modifications to the example section, altogether with some minor updates.
Targets some current difficulties in the actual implementation:
docker-compose
turned intodocker compose
If you find it useful, feel to merge 🙂