Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get results as records - an array of objects. #84

Open
hello-world-bfree opened this issue Dec 29, 2024 · 6 comments
Open

Get results as records - an array of objects. #84

hello-world-bfree opened this issue Dec 29, 2024 · 6 comments
Assignees
Labels
enhancement New feature or request

Comments

@hello-world-bfree
Copy link

Similar to how node-postgres returns results by default. The fact that it isn't the default isn't a problem, only that it's available. The wide variety of DuckDB data types makes this more difficult proposition than it might seem. Ideally the values would be in a JS native type for use and presentation, no longer as a DuckDBValue type.

@jraymakers
Copy link
Contributor

jraymakers commented Dec 29, 2024

There are getRows and getColumns methods on DuckDBDataChunk, DuckDBResult, and DuckDBResultReader; each returns an array of arrays of values (representing either rows or columns). If I understand correctly, this issue is about adding a getRowsAsObjects method that returns an array of objects, each with keys corresponding to the column names. Is that correct?

A related idea is to add a getColumnsAsObject method, which returns a (single) object, with keys corresponding to the column names, with values that are arrays of values, one array for each column. Is that useful?

In both of the above cases, there is a complication that column names in results are not necessarily unique. Non-unique column names can happen fairly often, for example when performing joins. DuckDB has a pattern for making column names unique by appending a number preceded by an underscore (e.g. _1, _2); it uses this when creating tables from queries, because tables must have unique column names. I'd probably use this same pattern to produce unique column keys in result objects when needed. That means this way of getting result data could return slightly different column names. Would this be okay for your use case?

Currently, the data values returned by getRows and getColumns are of type DuckDBValue, which includes both JS primitives and special-purpose objects for the more complex types (e.g. timestamps, decimals, lists, structs, etc.). For any of the above methods, options could be provided to transform the data values into other forms, such as ensuring they are all built-in JS types (primitives, arrays, objects, etc.), or ensuring the representation is serializable to JSON. (These two are similar but distinct, because some types, such as bigint, are JS built-ins but are not serializable to JSON.) Some of these transformations might be lossy. It seems this is also part of what you'd like - is that correct?

For the above, are you more interested in JS built-ins, or JSON serializability? For example, there are a few reasonable transformations for TIMESTAMP values: the raw epoch value (in microseconds) as a bigint, a string in ISO 8601 format, or a JS Date object. (The conversion to JS Date would be lossy in both precision and range. Only the string would be JSON-serializable.) Another example is DECIMAL values, which could be returned as objects including the width, scale, and numeric value (the latter as a JS bigint), or (with loss of precision) converted to a JS number. My inclination is to provide enough options to select between these reasonable alternatives, but I'm curious to know which you would use.

@jraymakers jraymakers added the enhancement New feature or request label Dec 29, 2024
@jraymakers jraymakers self-assigned this Dec 29, 2024
@hello-world-bfree
Copy link
Author

If I understand correctly, this issue is about adding a getRowsAsObjects method that returns an array of objects, each with keys corresponding to the column names. Is that correct?

Yes, sir! What I've been doing so far is getting back the DuckDBResultReader, procuring the columnNames and columnTypes, assigning a function to each corresponding index for how to parse that particular data type (i.e., usually .toString() since I don't immediately need the more complex types for my use case) then work through the .getRows() results, applying the corresponding parsing function to each value.

For my use case, most values don't actually need parsing. It's the date, timestamp, etc., types that I'm having to address. You make a great point about JS primitives vs. JSON serializability. My use case would lean more towards JSON serializable. In future, more complicated, data pulls, I'd love have the complex types unfurl into nested JSON serializable objects; i.e., lists, arrays, and structs represented in the record as a nested object:

[{
    my_array: [
        1,
        2
    ],
    my_struct: {
        one: 'one',
        two: 'two',
        three: 3,
        is_bool: true
    }
}]

I'd probably use this same pattern to produce unique column keys in result objects when needed. That means this way of getting result data could return slightly different column names. Would this be okay for your use case?

That makes a ton of sense to me!

For the above, are you more interested in JS built-ins, or JSON serializability? For example, there are a few reasonable transformations for TIMESTAMP values: the raw epoch value (in microseconds) as a bigint, a string in ISO 8601 format, or a JS Date object. (The conversion to JS Date would be lossy in both precision and range. Only the string would be JSON-serializable.)

TIMESTAMP makes a lot of sense as a lot sense, especially the full format. DECIMAL I'm much less confident in. Would definitely be best to have customizable options there.

SIDE NOTE: I did notice an unexpected result in node-neo when running .toString() on a timestamptz data type. If I supply the query select '2024-12-25'::timestamptz as my_tz then I get back 2024-12-25 05:00:00 in node-neo whereas I get back 2024-12-25 00:00:00-05 in the CLI. I'd expect the latter.

A related idea is to add a getColumnsAsObject method, which returns a (single) object, with keys corresponding to the column names, with values that are arrays of values, one array for each column. Is that useful?

Could definitely be useful! I don't personally have an immediate use case but I could definitely picture some.

@jraymakers
Copy link
Contributor

Thanks for the context! It helps me understand how to best support this.

Also thanks for the reminder about the TIMESTAMPTZ toString behavior. I have a TODO in the code for this, but I should file an issue to track it. I have code to display the timezone offset, but the problem is getting the correct timezone to pass to that logic. Ideally it would use DuckDB's current TimeZone setting, but that setting is connection-specific and requires running a SQL query to get. Having a toString method of a value depend on that kind of data is challenging. I could allow passing in the timezone to toString, but that just puts the burden of getting it on to the caller. So I'm still not sure the best way to handle this.

@freakynit
Copy link
Contributor

@hello-world-bfree

I was stuck with the same problem. Had to implement it all by hand.
Here is the schema and rows parser, if it helps: https://gist.github.com/freakynit/a539f55cf20baf607c970eb7ad6aaf68
The usage is mentioned below in the gist as comment.

Do review it once as I wrote it in very short time.

cc: @jraymakers See if this kind of conversion can be included in some utility api in node package itself.

@jraymakers
Copy link
Contributor

Yes, I consider this issue to track supporting an easy way to construct a JSON-serializable representation of a result.

It's a good point that there should be a way to do that for the column names & types as well. That shouldn't be hard to include.

@hello-world-bfree
Copy link
Author

Thanks @freakynit!

@jraymakers - Testing out passing the timestamptz string value to Date actually works nice; i.e., new Date('2024-12-25 00:00:00-05') provides 2024-12-25T05:00:00.000Z. With that being the case, I'm not nearly as worried about potential issues there.

Going through the process of creating a result parser, the more I appreciate a lot of the potential decision points. I was originally concerned with the timestamp types since they provide micros instead of millis, but as mentioned above passing the .toString() values to Date when needed alleviates a lot of those concerns. Not quite sure what makes the most sense for DuckDBIntervalValue or the more fine-grained timestamp values, i.e., DuckDBTimestampNanosecondsValue, but .toDouble() and the string versions of the granular timestamps seem more than adequate.

I do think I'll have another function to create an array of arrays version of serializable results, too, if I could humbly suggest that as a utility as well. Honestly, I think I almost immediately have a use case for having all 3 possible results discussed - serializable array of records, serializable array of arrays, and the standard array of arrays result in DuckDB types - right out of the gate.

Thanks again for all your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants