-
Notifications
You must be signed in to change notification settings - Fork 853
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parsing a string column containing JSON values into a typed array #6522
Comments
It turned out the sync reader was not being exercised by basic read tests. Enabling it exposed a broken json parsing algo that had already been fixed in the default reader. Factor out the json parsing to a shared function that both engines can use. While we're at it, factor out sync reader logic that both parquet and json readers can use. Update the basic read unit tests to use both readers. Fixes #372 Relevant upstream feature request: apache/arrow-rs#6522 --------- Co-authored-by: Nick Lanham <[email protected]>
Sorry this one managed to slip through, adding num_buffered_rows and has_partial_record seems perfectly reasonable to me |
take |
Hi, Do we need to duplicate the changes, which are implemented in https://github.com/delta-io/delta-kernel-rs/pull/373/files |
Good question. We're happy to tweak the delta-kernel-rs code to match a new arrow-rust API, as long the new API covers the use case. I tried to factor that out in the "details" sections of this issue description. If you refer to the parse_json_impl method in that PR, it corresponds to my comment in this issue's description:
Seems like the low-level support can go independently of a decision to expose a new public |
I prototyped this last month for polars, could share, it's a lot, one big issue though is the struct field isn't suited for json, because struct needs a schema and assumes json documents are homogenous. If one has flat mappings with homogeneous fields, then structs make sense, or flat lists of homogenous value type, that works with list type However, for arbitrary json, like mappings with heterogenous keys, nested lists or list values in mappings, offsets arrays don't make sense for deeply nested paths. Also, heterogeneous flat leaf values with no keys, are valid json. To make robust support for json in arrow, the best datatype to build on is string. Alas, the normal string type does not cut it, because we need to know from schemas when a string array is one of normal text and when it is an array of json strings. If both normal text and json are "string" then the user needs to keep a separate schema outside the one from arrow. That might work for one's own codebase, but not for someone else's. Therefore I suggest adding a new datatype to Arrow which is identical to string datatype except it is named "json" to facilitate different handling of that kind of string (with serde) |
https://arrow.apache.org/docs/format/CanonicalExtensions.html#json can be used. |
Perhaps the real concern is about the "bonus" request?
If it's not immediately and obviously desirable to add a that method, we should just drop the idea, or split it off as a separate issue. But hopefully we can still do the simple part that unblocks safe parsing of JSON strings? |
This seems a bit surprising, given that the feature request is to define new
This is definitely a general problem when parsing arbitrary JSON data, but IMO solving it is out of scope for the main part of this feature request. Especially given that arrow-rs/json already has public API methods that parse JSON data with a homogenous schema into a struct. Spark and other systems have the same. It's just that the existing arrow-rs support is a pain to use if the JSON bytes come from a Such capability would be extremely useful for the common case where the schema is in fact homogeneous and not too deep. Meanwhile:
and
Might I suggest taking a look at the new "variant" data type that spark added last year, and which will likely become an official parquet data type soon? It's specifically designed to handle deeply nested and strongly heterogenous data as efficiently as possible. It looks like there's already a general tracking issue for arrow (apache/arrow#42069), and people are already exploring adding that support to arrow-rs/parquet (#6736). |
well, you know how it is, a day or two of struggle writing new code saves 15-30 minutes reading the instructions i'm a major noob with arrow internals, and didn't know all that stuff existed, polars doesn't expose the JSON type @mbrobbel mentioned (which is perfect for my use case), and some dude instantly closed my issue about it with a one liner comment telling me to use struct, so I wrote functions to convert vectors of serde_json values into polars columns As far as I know, arrow json reader is for cases where you have a json for each row of a table; but I have a column of jsons and wanted to keep them all in a single column The biggest pain by an exponential margin was concatenating the How do we concatenate arrays? My spaghetti works but I can't say I'm proud of it Oh, of course now I see https://docs.rs/arrow/latest/arrow/compute/fn.concat.html Anyway, arrow is a cool library, I learned a lot, I've mostly avoided Happy to post a gist, it's about 800 lines, i didn't include the imports, uses serde_json and polars, hope it helps, needs translation from polars to arrow and you can absolutely make a better version, totally agree json -> struct support is a good add https://gist.github.com/bionicles/f7dd0eac5d3ed44c919a3b7a5c44d285 |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I have a nullable
StringArray
column that contains JSON object literals.I need to JSON parse the column into a
StructArray
of values following some schema, and NULL input values should become NULL output values.This can almost be implemented using arrow_json::reader::ReaderBuilder::build_decoder and then feeding in the bytes of each string. But the decoder has no concept of record separators in the input stream. Thus, invalid inputs such as blank strings (
""
), or truncated records ("{\"a\":1"
), or multiple objects ("{\"a\": 1} {\"a\": 2}"
) will confuse the decoding process. If we're lucky, it will produce the wrong number of records, but an adversarial input could easily seem to produce the correct number of records even tho no single input string represented a valid JSON object. Thus, if I want such safety, I'm forced to parse each string as its ownRecordBatch
(which can then be validated independently), and then concatenate them all. Ugly, error-prone, and inefficient:(example code, has panics instead of full error handling)
Describe the solution you'd like
Ideally, the JSON Decoder could define public methods that say how many buffered rows the decoder has, and whether the decoder is currently at a record boundary or not. This is essentially a side effect-free version the same check that
Tape::finish
already performs whenDecoder::flush
is called:and
That way, the above implementation becomes a bit simpler and a lot more efficient:
It would be even nicer if the
parse_json
method could just become part of either arrow-json or arrow-compute, if parsing JSON strings to structs is deemed a general operation that deserves its own API call.Describe alternatives you've considered
Tried shoving each string manually into a
Decoder
to produce a singleRecordBatch
, but the above-mentioned safety issues made it very brittle (wrong row counts, incorrect values, etc). Currently using the ugly/slow solution mentioned earlier, that creates and validates oneRecordBatch
per row, before concatenating them all into a singleRecordBatch
.The text was updated successfully, but these errors were encountered: