Script to inline DocumentReference URLs #14

mikix · 2023-11-01T17:36:25Z

mikix
Nov 1, 2023
Maintainer

Problem

I needed to inline some external attachments inside already-exported DocumentReferences.

bulk-data-client has some support for doing that, but it doesn't recover from any download errors (which we do see with Cerner for certain notes), does not download in parallel, and wants to download all mimetypes even if it isn't going to inline them. Plus, it's designed to work in the context of the original bulk export, not after the fact.

So I made a tiny little script to inline the attachments of existing ndjson files. And for clarity, when I say "inline" I mean "download the actual notes pointed to by a DocumentReference url and put it back inside the DocumentReference in the data field as base64-encoded text". If you're archiving exported ndjson like I am, this is important so that you always have a local copy of the note.

Notes

This only inlines html & text attachments. That's easy to edit at the top if you have different needs though. (And the contentType field must be present, or we skip the record since we don't know what mimetype it has.)
This relies heavily on some cumulus-etl code. So make sure that package is visible to your python (or run the script inside the cumulus-etl docker image).
This will make a second copy of the data in an output folder, rather than modify the input files
This does not delete the url field, but simply adds a data field and records the encoding
This is licensed as Apache 2.0
Run like ./inline.py INPUT_DIR OUTPUT_DIR --authentication-arguments (see --help)

Script

#!/usr/bin/env python3

import argparse
import asyncio
import base64
import cgi
import os

from cumulus_etl import cli_utils, common, fhir, store


# These will be inlined
MIMETYPES = {
    "application/xhtml+xml",
    "text/html",
    "text/plain",
}


def define_etl_parser() -> argparse.ArgumentParser:
    parser = argparse.ArgumentParser()

    parser.add_argument("dir_input", metavar="/path/to/input")
    parser.add_argument("dir_output", metavar="/path/to/output")

    cli_utils.add_auth(parser)

    return parser


async def inline_one_docref(client: fhir.FhirClient, docref: dict) -> (int, dict):
    attachments = [content["attachment"] for content in docref["content"]]
    count = 0

    for attachment in attachments:
        if "contentType" in attachment and "url" in attachment and "data" not in attachment:
            mimetype, _ = cgi.parse_header(attachment["contentType"])
            if mimetype in MIMETYPES:
                try:
                    response = await client.request("GET", attachment["url"], headers={"Accept": mimetype})
                except Exception:
                    print(f"Failed to inline {mimetype} for DocRef {docref['id']}")
                    continue
                attachment["data"] = base64.standard_b64encode(response.content).decode("ascii")
                attachment["contentType"] = f"{mimetype}; charset={response.encoding}"
                count += 1

    return count, docref


async def inline_one_file(client: fhir.FhirClient, input_path: str, output_path: str) -> int:
    docrefs = await asyncio.gather(*[
        inline_one_docref(client, docref)
        for docref in common.read_ndjson(input_path)
    ])

    total_count = 0
    with common.NdjsonWriter(output_path) as writer:
        for count, docref in docrefs:
            total_count += count
            writer.write(docref)

    return total_count


async def main():
    parser = define_etl_parser()
    args = parser.parse_args()

    root_input = store.Root(args.dir_input)
    root_output = store.Root(args.dir_output, create=True)

    client = fhir.create_fhir_client_for_cli(args, root_input, ["DocumentReference"])
    async with client:
        print("Inlining…")
        count = 0
        for input_path in common.ls_resources(root_input, "DocumentReference"):
            basename = os.path.basename(input_path)
            output_path = root_output.joinpath(basename)
            count += await inline_one_file(client, input_path, output_path)

    print(f"⭐ Successfully inlined {count} DocRef attachments! ⭐")


if __name__ == "__main__":
    asyncio.run(main())

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Script to inline DocumentReference URLs #14

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Script to inline DocumentReference URLs #14

mikix Nov 1, 2023 Maintainer

Problem

Notes

Script

Replies: 0 comments

mikix
Nov 1, 2023
Maintainer