Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: don't load NDJSON data into memory all at once #344

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mikix
Copy link
Contributor

@mikix mikix commented Jan 30, 2025

This commit changes how --load-ndjson-dir works, telling DuckDB to load the data from files on disk itself, rather than us loading that data into memory all at once, then handing it to DuckDB.

Which allows querying larger-than-memory data sets.

But... the SQL itself can still be a memory bottleneck, if it requires loading too much data into memory.

Checklist

  • Consider if documentation in docs/ needs to be updated
    • If you've changed the structure of a table, you may need to run generate-md
    • If you've added/removed core study fields that not in US Core, update our list of those in core-study-details.md
  • Consider if tests should be added
  • Update template repo if there are changes to study configuration in manifest.toml

def operational_errors(self) -> tuple[Exception]:
def operational_errors(self) -> tuple[type[Exception], ...]:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated, but I noticed that this method had incorrect typing (I think my fault).


return all_tables


def _handle_load_ndjson_dir(args: dict[str, str], backend: base.DatabaseBackend) -> None:
Copy link
Contributor Author

@mikix mikix Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is mostly a refactor, with the only new functionality a progress bar for loading scanning the NDJSON for its schema. (Which in my 26G test folder takes four minutes.)

I was feeling like the --load-ndjson-dir logic was spread around enough if/elses that it was hard to reason about. So moving it all into one method helped, especially now that there is extra code for the progress bar.

This commit changes how --load-ndjson-dir works, telling DuckDB to load
the data from files on disk itself, rather than us loading that data
into memory all at once, then handing it to DuckDB.

Which allows querying larger-than-memory data sets.

But... the SQL itself can still be a memory bottleneck, if it requires
loading too much data into memory.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant