fix: don't load NDJSON data into memory all at once #344

mikix · 2025-01-30T17:24:04Z

This commit changes how --load-ndjson-dir works, telling DuckDB to load the data from files on disk itself, rather than us loading that data into memory all at once, then handing it to DuckDB.

Which allows querying larger-than-memory data sets.

But... the SQL itself can still be a memory bottleneck, if it requires loading too much data into memory.

Checklist

Consider if documentation in docs/ needs to be updated
- If you've changed the structure of a table, you may need to run generate-md
- If you've added/removed core study fields that not in US Core, update our list of those in core-study-details.md
Consider if tests should be added
Update template repo if there are changes to study configuration in manifest.toml

mikix · 2025-01-30T17:24:39Z

cumulus_library/databases/athena.py

-    def operational_errors(self) -> tuple[Exception]:
+    def operational_errors(self) -> tuple[type[Exception], ...]:


Unrelated, but I noticed that this method had incorrect typing (I think my fault).

mikix · 2025-01-30T17:27:39Z

cumulus_library/databases/utils.py


    return all_tables


+def _handle_load_ndjson_dir(args: dict[str, str], backend: base.DatabaseBackend) -> None:


This method is mostly a refactor, with the only new functionality a progress bar for loading scanning the NDJSON for its schema. (Which in my 26G test folder takes four minutes.)

I was feeling like the --load-ndjson-dir logic was spread around enough if/elses that it was hard to reason about. So moving it all into one method helped, especially now that there is extra code for the progress bar.

This commit changes how --load-ndjson-dir works, telling DuckDB to load the data from files on disk itself, rather than us loading that data into memory all at once, then handing it to DuckDB. Which allows querying larger-than-memory data sets. But... the SQL itself can still be a memory bottleneck, if it requires loading too much data into memory.

mikix commented Jan 30, 2025

View reviewed changes

mikix force-pushed the mikix/streaming branch from 4a44670 to b7ecea0 Compare January 30, 2025 17:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: don't load NDJSON data into memory all at once #344

fix: don't load NDJSON data into memory all at once #344

mikix commented Jan 30, 2025 •

edited

Loading

mikix Jan 30, 2025

mikix Jan 30, 2025 •

edited

Loading

		def operational_errors(self) -> tuple[Exception]:
		def operational_errors(self) -> tuple[type[Exception], ...]:


		return all_tables


		def _handle_load_ndjson_dir(args: dict[str, str], backend: base.DatabaseBackend) -> None:

fix: don't load NDJSON data into memory all at once #344

Are you sure you want to change the base?

fix: don't load NDJSON data into memory all at once #344

Conversation

mikix commented Jan 30, 2025 • edited Loading

Checklist

mikix Jan 30, 2025

Choose a reason for hiding this comment

mikix Jan 30, 2025 • edited Loading

Choose a reason for hiding this comment

mikix commented Jan 30, 2025 •

edited

Loading

mikix Jan 30, 2025 •

edited

Loading