Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing a string column containing JSON values into a typed array #6522

Open
scovich opened this issue Oct 7, 2024 · 9 comments
Open

Parsing a string column containing JSON values into a typed array #6522

scovich opened this issue Oct 7, 2024 · 9 comments
Assignees
Labels
enhancement Any new improvement worthy of a entry in the changelog good first issue Good for newcomers help wanted

Comments

@scovich
Copy link

scovich commented Oct 7, 2024

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

I have a nullable StringArray column that contains JSON object literals.

I need to JSON parse the column into a StructArray of values following some schema, and NULL input values should become NULL output values.

This can almost be implemented using arrow_json::reader::ReaderBuilder::build_decoder and then feeding in the bytes of each string. But the decoder has no concept of record separators in the input stream. Thus, invalid inputs such as blank strings (""), or truncated records ("{\"a\":1"), or multiple objects ("{\"a\": 1} {\"a\": 2}") will confuse the decoding process. If we're lucky, it will produce the wrong number of records, but an adversarial input could easily seem to produce the correct number of records even tho no single input string represented a valid JSON object. Thus, if I want such safety, I'm forced to parse each string as its own RecordBatch (which can then be validated independently), and then concatenate them all. Ugly, error-prone, and inefficient:

pub fn parse_json(
    json_strings: StringArray, 
    schema: SchemaRef,
) -> Result<RecordBatch, ArrowError> {
    // Use batch size of 1 to force one record per string input 
    let mut decoder = ReaderBuilder::new(output_schema.clone())
        .with_batch_size(1)
        .build_decoder()?;

    // Feed a single string into the decoder and flush it to a record batch
    let mut parse_one = |json_string: Option<&str>| -> Result<RecordBatch, ArrowError> {
        // NOTE: null input becomes empty object (all fields null)
        let s = json_string.unwrap_or("{}");
        let mut reader = BufReader::new(s.as_bytes());
        let buf = reader.fill_buf()?;
        let read = buf.len();
        let decoded = decoder.decode(buf)?;
        assert_eq!(decoded, read);
        Ok(decoder.flush()?.unwrap())
    };
    let output: Vec<_> = json_strings
        .iter()
        .map(parse_one)
        .try_collect()?;
    concat_batches(&schema, output.iter())
}

(example code, has panics instead of full error handling)

Describe the solution you'd like

Ideally, the JSON Decoder could define public methods that say how many buffered rows the decoder has, and whether the decoder is currently at a record boundary or not. This is essentially a side effect-free version the same check that Tape::finish already performs when Decoder::flush is called:

impl TapeDecoder {
      ...
    /// The number of buffered rows this decoder has, including any in progress if [`has_partial_record()`].
    pub fn num_buffered_rows(&self) -> usize {
        self.cur_row
    }
    /// True if the decoder is part way through decoding a record. If so, calling [`finish`] would return an error.
    pub fn has_partial_row(&self) -> bool {
        !self.stack.is_empty()
    }

and

impl Decoder {
      ...
    /// The number records currently buffered in this decoder. 
    /// A successful call to [`flush`] would produce this many rows. 
    pub fn num_buffered_records(&self) -> usize {
        self.tape_decoder.num_buffered_rows()
    }
    /// True if the decoder is part way through decoding a record. 
    /// Calling [`flush`] on a partial record would return an error.
    pub fn has_partial_record(&self) -> bool {
        self.tape_decoder.has_partial_record()
    }

That way, the above implementation becomes a bit simpler and a lot more efficient:

pub fn parse_json(
    json_strings: StringArray, 
    schema: SchemaRef,
) -> Result<RecordBatch, ArrowError> {
    let mut decoder = ReaderBuilder::new(output_schema.clone())
        .with_batch_size(json_strings.len())
        .build_decoder()?;

    // Feed a single string into the decoder and verify it parsed as exactly one record
    let mut parse_one = |json_string| -> Result<(), ArrowError> {
        let mut reader = BufReader::new(json_string.as_bytes());
        let buf = reader.fill_buf()?;
        let read = buf.len();
        let decoded = decoder.decode(buf)?;
        assert_eq!(decoded, read);
        assert!(!decoder.has_partial_record());
        Ok(())
    };
    
    // Make sure each string produces exactly one record.
    for (i, json_string) in json_strings.into_iter().enumerate() {
        // NOTE: None becomes a NULL object 
        parse_one(json_string.unwrap_or("null"))?;
        assert_eq!(decoder.num_buffered_records(), i+1);
    }
    decoder.flush().transpose().unwrap()
}

It would be even nicer if the parse_json method could just become part of either arrow-json or arrow-compute, if parsing JSON strings to structs is deemed a general operation that deserves its own API call.

Describe alternatives you've considered

Tried shoving each string manually into a Decoder to produce a single RecordBatch, but the above-mentioned safety issues made it very brittle (wrong row counts, incorrect values, etc). Currently using the ugly/slow solution mentioned earlier, that creates and validates one RecordBatch per row, before concatenating them all into a single RecordBatch.

@scovich scovich added the enhancement Any new improvement worthy of a entry in the changelog label Oct 7, 2024
scovich added a commit to delta-io/delta-kernel-rs that referenced this issue Oct 7, 2024
It turned out the sync reader was not being exercised by basic read
tests. Enabling it exposed a broken json parsing algo that had already
been fixed in the default reader.

Factor out the json parsing to a shared function that both engines can
use.

While we're at it, factor out sync reader logic that both parquet and
json readers can use.

Update the basic read unit tests to use both readers.

Fixes #372

Relevant upstream feature request:
apache/arrow-rs#6522

---------

Co-authored-by: Nick Lanham <[email protected]>
@tustvold
Copy link
Contributor

Sorry this one managed to slip through, adding num_buffered_rows and has_partial_record seems perfectly reasonable to me

@jatin510
Copy link

take

@jatin510
Copy link

Hi,
@scovich @tustvold I am currently looking into this.

Do we need to duplicate the changes, which are implemented in https://github.com/delta-io/delta-kernel-rs/pull/373/files
in the arrow-rs, along with num_buffered_rows has_partial_record functions ?

@scovich
Copy link
Author

scovich commented Nov 18, 2024

Do we need to duplicate the changes, which are implemented in https://github.com/delta-io/delta-kernel-rs/pull/373/files
in the arrow-rs, along with num_buffered_rows has_partial_record functions ?

Good question. We're happy to tweak the delta-kernel-rs code to match a new arrow-rust API, as long the new API covers the use case. I tried to factor that out in the "details" sections of this issue description.

If you refer to the parse_json_impl method in that PR, it corresponds to my comment in this issue's description:

It would be even nicer if the parse_json method could just become part of either arrow-json or arrow-compute, if parsing strings to JSON is deemed a general operation that deserves its own API call.

Seems like the low-level support can go independently of a decision to expose a new public parse_json method in arrow-compute or arrow-json?

@bionicles
Copy link

bionicles commented Feb 3, 2025

I prototyped this last month for polars, could share, it's a lot, one big issue though is the struct field isn't suited for json, because struct needs a schema and assumes json documents are homogenous.

If one has flat mappings with homogeneous fields, then structs make sense, or flat lists of homogenous value type, that works with list type

However, for arbitrary json, like mappings with heterogenous keys, nested lists or list values in mappings, offsets arrays don't make sense for deeply nested paths. Also, heterogeneous flat leaf values with no keys, are valid json.

To make robust support for json in arrow, the best datatype to build on is string.

Alas, the normal string type does not cut it, because we need to know from schemas when a string array is one of normal text and when it is an array of json strings. If both normal text and json are "string" then the user needs to keep a separate schema outside the one from arrow. That might work for one's own codebase, but not for someone else's.

Therefore I suggest adding a new datatype to Arrow which is identical to string datatype except it is named "json" to facilitate different handling of that kind of string (with serde)

@mbrobbel
Copy link
Contributor

mbrobbel commented Feb 3, 2025

Therefore I suggest adding a new datatype to Arrow which is identical to string datatype except it is named "json" to facilitate different handling of that kind of string (with serde)

https://arrow.apache.org/docs/format/CanonicalExtensions.html#json can be used.

@scovich
Copy link
Author

scovich commented Feb 4, 2025

Perhaps the real concern is about the "bonus" request?

It would be even nicer if the parse_json method could just become part of either arrow-json or arrow-compute, if parsing JSON strings into structs is deemed a general operation that deserves its own API call.

If it's not immediately and obviously desirable to add a that method, we should just drop the idea, or split it off as a separate issue.

But hopefully we can still do the simple part that unblocks safe parsing of JSON strings?

@scovich
Copy link
Author

scovich commented Feb 4, 2025

I prototyped this last month for polars, could share, it's a lot

This seems a bit surprising, given that the feature request is to define new num_buffered_rows and has_partial_record methods that publicly expose information the JSON parser already tracks internally?

one big issue though is the struct field isn't suited for json, because struct needs a schema and assumes json documents are homogenous.

This is definitely a general problem when parsing arbitrary JSON data, but IMO solving it is out of scope for the main part of this feature request. Especially given that arrow-rs/json already has public API methods that parse JSON data with a homogenous schema into a struct. Spark and other systems have the same. It's just that the existing arrow-rs support is a pain to use if the JSON bytes come from a StringArray instead of a file. Thus, the basic ask is super simple: Expose the utility methods that allow a user to map from StringArray to StructArray using the exact same existing capability (with all the same limitations) that arrow_json::reader already provides. The only difference is the source of the raw json bytes.

Such capability would be extremely useful for the common case where the schema is in fact homogeneous and not too deep.

Meanwhile:

for arbitrary json, like mappings with heterogenous keys, nested lists or list values in mappings, offsets arrays don't make sense for deeply nested paths. Also, heterogeneous flat leaf values with no keys, are valid json.

and

I suggest adding a new datatype to Arrow which is identical to string datatype except it is named "json" to facilitate different handling of that kind of string (with serde)

Might I suggest taking a look at the new "variant" data type that spark added last year, and which will likely become an official parquet data type soon?

It's specifically designed to handle deeply nested and strongly heterogenous data as efficiently as possible. It looks like there's already a general tracking issue for arrow (apache/arrow#42069), and people are already exploring adding that support to arrow-rs/parquet (#6736).

@bionicles
Copy link

well, you know how it is, a day or two of struggle writing new code saves 15-30 minutes reading the instructions

i'm a major noob with arrow internals, and didn't know all that stuff existed, polars doesn't expose the JSON type @mbrobbel mentioned (which is perfect for my use case), and some dude instantly closed my issue about it with a one liner comment telling me to use struct, so I wrote functions to convert vectors of serde_json values into polars columns

As far as I know, arrow json reader is for cases where you have a json for each row of a table; but I have a column of jsons and wanted to keep them all in a single column

The biggest pain by an exponential margin was concatenating the Box<dyn Array>, because I didn't know how they worked, then assumed Offsets were cumulative, then didn't realize no need to push initial zero, then didn't realize I couldn't skip zeroeth array, and then realized had to try_push subarray lengths onto offsets instead of pushing offsets onto offsets, then various recursion bugs,

Image
where

Image

How do we concatenate arrays? My spaghetti works but I can't say I'm proud of it

Image

Oh, of course now I see https://docs.rs/arrow/latest/arrow/compute/fn.concat.html

Anyway, arrow is a cool library, I learned a lot, I've mostly avoided Box<dyn T> for performance, but dyn seems like a more extensible approach to type-heterogeneity than enums, and Arrow seems pretty performant

Happy to post a gist, it's about 800 lines, i didn't include the imports, uses serde_json and polars, hope it helps, needs translation from polars to arrow and you can absolutely make a better version, totally agree json -> struct support is a good add

https://gist.github.com/bionicles/f7dd0eac5d3ed44c919a3b7a5c44d285

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog good first issue Good for newcomers help wanted
Projects
None yet
Development

No branches or pull requests

5 participants