You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I needed to inline some external attachments inside already-exported DocumentReferences.
bulk-data-client has some support for doing that, but it doesn't recover from any download errors (which we do see with Cerner for certain notes), does not download in parallel, and wants to download all mimetypes even if it isn't going to inline them. Plus, it's designed to work in the context of the original bulk export, not after the fact.
So I made a tiny little script to inline the attachments of existing ndjson files. And for clarity, when I say "inline" I mean "download the actual notes pointed to by a DocumentReference url and put it back inside the DocumentReference in the data field as base64-encoded text". If you're archiving exported ndjson like I am, this is important so that you always have a local copy of the note.
Notes
This only inlines html & text attachments. That's easy to edit at the top if you have different needs though. (And the contentType field must be present, or we skip the record since we don't know what mimetype it has.)
This relies heavily on some cumulus-etl code. So make sure that package is visible to your python (or run the script inside the cumulus-etl docker image).
This will make a second copy of the data in an output folder, rather than modify the input files
This does not delete the url field, but simply adds a data field and records the encoding
This is licensed as Apache 2.0
Run like ./inline.py INPUT_DIR OUTPUT_DIR --authentication-arguments (see --help)
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Problem
I needed to inline some external attachments inside already-exported DocumentReferences.
bulk-data-client has some support for doing that, but it doesn't recover from any download errors (which we do see with Cerner for certain notes), does not download in parallel, and wants to download all mimetypes even if it isn't going to inline them. Plus, it's designed to work in the context of the original bulk export, not after the fact.
So I made a tiny little script to inline the attachments of existing ndjson files. And for clarity, when I say "inline" I mean "download the actual notes pointed to by a DocumentReference url and put it back inside the DocumentReference in the data field as base64-encoded text". If you're archiving exported ndjson like I am, this is important so that you always have a local copy of the note.
Notes
contentType
field must be present, or we skip the record since we don't know what mimetype it has.)url
field, but simply adds adata
field and records the encoding./inline.py INPUT_DIR OUTPUT_DIR --authentication-arguments
(see--help
)Script
Beta Was this translation helpful? Give feedback.
All reactions