Skip to content

feat(extensions): add unsigned integer extension types (u8, u16, u32, u64)#953

Open
kadinrabo wants to merge 24 commits intosubstrait-io:mainfrom
kadinrabo:feat/unsigned-extension-types
Open

feat(extensions): add unsigned integer extension types (u8, u16, u32, u64)#953
kadinrabo wants to merge 24 commits intosubstrait-io:mainfrom
kadinrabo:feat/unsigned-extension-types

Conversation

@kadinrabo
Copy link
Copy Markdown
Contributor

@kadinrabo kadinrabo commented Jan 29, 2026

Description

Adds unsigned integer types (u8, u16, u32, u64) as first-class extension types with arithmetic function support and test coverage.

  • Self-contained unsigned_integers.yaml with type definitions (string structure encoding) and arithmetic function overloads (add, subtract, multiply, divide, modulus, sum, min, max)
  • functions_arithmetic.yaml is untouched
  • Test cases in tests/cases/arithmetic_unsigned/, following the arithmetic_decimal convention
  • Generic udtArg grammar rule for parsing UDT literals in test cases
  • Test framework updated to scan all extension YAML files for function definitions

Closes #944 and follows up community agreement from Substrait Meeting Notes on 28 Jan 2026 that type variations are not appropriate for unsigned integers due to differing semantics.


This change is Reviewable

@kadinrabo kadinrabo force-pushed the feat/unsigned-extension-types branch from e4584ca to daccb86 Compare January 30, 2026 15:36
@kadinrabo kadinrabo force-pushed the feat/unsigned-extension-types branch from daccb86 to d19c87a Compare January 30, 2026 15:38
@kadinrabo kadinrabo closed this Jan 30, 2026
@kadinrabo kadinrabo reopened this Jan 30, 2026
@kadinrabo kadinrabo force-pushed the feat/unsigned-extension-types branch from 6d28b59 to 4af53f8 Compare January 30, 2026 20:24
@kadinrabo kadinrabo closed this Jan 30, 2026
@kadinrabo kadinrabo reopened this Jan 30, 2026
@kadinrabo kadinrabo force-pushed the feat/unsigned-extension-types branch from 4af53f8 to 7d6f7f9 Compare January 30, 2026 20:53
@kadinrabo kadinrabo closed this Jan 30, 2026
@kadinrabo kadinrabo marked this pull request as ready for review January 30, 2026 21:06
@kadinrabo kadinrabo reopened this Jan 30, 2026
@kadinrabo kadinrabo marked this pull request as draft January 30, 2026 21:07
@kadinrabo kadinrabo force-pushed the feat/unsigned-extension-types branch 3 times, most recently from 8650358 to 10baf81 Compare February 4, 2026 15:53
@kadinrabo kadinrabo marked this pull request as ready for review February 4, 2026 15:59
Copy link
Copy Markdown
Member

@benbellick benbellick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this 🙂

Left a few more comments

-
name: "divide"
description: >
Divide x by y. Partial values are truncated (rounded towards 0).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if y is zero?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should include the divide description from

-
name: "divide"
description: >
Divide x by y. In the case of integer division, partial values are truncated (i.e. rounded towards 0).
The `on_division_by_zero` option governs behavior in cases where y is 0. If the option is IEEE then
the IEEE754 standard is followed: all values except +/-infinity return NaN and +/-infinity are unchanged.
If the option is LIMIT then the result is +/-infinity in all cases.
If either x or y are NaN then behavior will be governed by `on_domain_error`.
If x and y are both +/-infinity, behavior will be governed by `on_domain_error`.
impls:
here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can go with something like this, thoughts?

Divide x by y. Partial values are truncated (i.e. rounded towards 0).
The `on_division_by_zero` option governs behavior in cases where y is 0.                                                                                                                                                       
If either x or y are out of range, behavior will be governed by `on_domain_error`.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to have null cases added to these tests? Might be overkill but can't hurt.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to keep parity with signed tests here, which don't have null cases for add/subtract/multiply either. Open to adding them if you feel strongly, but might be better as a follow up

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with not doing this for now.

Copy link
Copy Markdown
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good overall to me.

The one thing I would like to see is that for the test files, let's map over not just the basic tests but also the overflow, null handling, etc tests that are applicable to usigned integers. I think the floating exception tests are the only ones that don't make apply.

-
name: "divide"
description: >
Divide x by y. Partial values are truncated (rounded towards 0).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should include the divide description from

-
name: "divide"
description: >
Divide x by y. In the case of integer division, partial values are truncated (i.e. rounded towards 0).
The `on_division_by_zero` option governs behavior in cases where y is 0. If the option is IEEE then
the IEEE754 standard is followed: all values except +/-infinity return NaN and +/-infinity are unchanged.
If the option is LIMIT then the result is +/-infinity in all cases.
If either x or y are NaN then behavior will be governed by `on_domain_error`.
If x and y are both +/-infinity, behavior will be governed by `on_domain_error`.
impls:
here.

Copy link
Copy Markdown
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left one comment on the definition of divide, and I think we should pull modulus out of this entirely because the existing definition could use a once over, but otherwise looks good to me.

The core set of functions you've added, along with the tests files, set a good example for how to add both future functions for unsigned integers, and also for adding new types.

division_type:
values: [ TRUNCATE, FLOOR ]
overflow:
values: [ SILENT, SATURATE, ERROR ]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can... modulus even overflow?

value: u!u8
options:
division_type:
values: [ TRUNCATE, FLOOR ]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is division_type an option we need here. The signed modulus operator defines 2 tests for them:

# division_type: Examples demonstrating truncate and floor division types
modulus(8::i8, -3::i8) [division_type:TRUNCATE] = 2::i8
modulus(8::i8, -3::i8) [division_type:FLOOR] = -1::i8
. I'm not sure this is applicable for unsigned ints.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with not doing this for now.

value: u!u8
options:
overflow:
values: [ SILENT, SATURATE, ERROR ]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With signed integers, the overflow cases involve a sign change.

# overflow: Examples demonstrating overflow behavior
divide(-9223372036854775808::i64, -1::i64) [overflow:ERROR] = <!ERROR>
divide(-128::i8, -1::i8) [overflow:SATURATE] = 127::i8

With unsigned integers, this case isn't applicable, and in general overflow shouldn't be possible so we can drop this option fully for all of the impls.

@vbarua vbarua added the PMC Ready PRs ready for review by PMCs label Mar 6, 2026
Copy link
Copy Markdown
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me, thanks for working on this Kadin.

Looking at this has definitely made me notice some stuff in the existing arithmetic extension that could use improvements as well 🧹

Comment on lines +435 to +436
# Type is "u!" + identifier, e.g., "u!u8"
type_str = "u!" + ctx.Identifier().getText().lower()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this handle YAML dependency reference or that is out of scope in the test?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By yaml dependency reference, are you talking about this feature for extensions?

dependencies:
# For reusing type classes and type variations from other extension files.
# The keys are namespace identifiers that you can then use as dot-separated
# prefix for type class and type variation names in functions and the base
# type class for variations. The values must be extension URNs, following
# the same format and conventions as those used in the proto plans.
type: object
patternProperties:
"^[a-zA-Z_\\$][a-zA-Z0-9_\\$]*$":
type: string

I don't think the test cases support dependency references, yet

substrait/tests/README.md

Lines 92 to 101 in 00bc3c2

### Spec
```
doc := <version>
<include>
(<dependency>)*
((<test_group>)?(<test_case>)+\n)+
version := ### SUBSTRAIT_SCALAR_TEST: <test_library_version>
include := ### SUBSTRAIT_INCLUDE: <uri>
dependency := ### SUBSTRAIT_DEPENDENCY: <uri>

Copy link
Copy Markdown
Contributor

@yongchul yongchul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+0 using string for representation (because we go at length in defining representation in core types (e.g., decimal)) but overall looks good to me.

@vbarua
Copy link
Copy Markdown
Member

vbarua commented Mar 25, 2026

@yongchul what does

+0 using string for representation ...

from you mean here? Is that an approval with no vote?

@jacques-n
Copy link
Copy Markdown
Contributor

jacques-n commented Mar 30, 2026

+0 using string for representation (because we go at length in defining representation in core types (e.g., decimal)) but overall looks good to me.

I'm more like a -0.1 for the string representation. why not use the corresponding same sized signed types OR fixed binary types (we have those, right?)? (asking for a friend)

@vbarua
Copy link
Copy Markdown
Member

vbarua commented Mar 31, 2026

I'm more like a -0.1 for the string representation. why not use the corresponding same sized signed types OR fixed binary types (we have those, right?)? (asking for a friend)

There's a bit of tension here between a good representation for the test cases, and a good representation for actual systems.

For tests, I find it hard to read them when I have to convert the i8 into u8 values in my head. Like this is what a test for overflow would look like:

add(-1::u!u8, 1:u!u8)

What I want is to be able to use the string format in the tests, which makes them readable

add('255':u!u8, '1':u!u8)

and use the signed type encoding for actual systems. Right now we're piggy-backing off of the struct representation defined for the udt to define literals of udt's in the tests. I'm mulling some changes around this, either to allow for multiple struct encodings OR to define test encodings separately.

@jacques-n
Copy link
Copy Markdown
Contributor

There's a bit of tension here between a good representation for the test cases...

Agree that -1 would suck for test cases. I don't have great ideas on how to resolve the mismatch. One other option I could see is to use the next integer size up for the smaller ones and decimal for the biggest one. then representation isn't ugly and always safe (e.g. a u8 always fits in a u16 so a system that doesn't understand u8 but understands u16 could work with it).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

PMC Ready PRs ready for review by PMCs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Handling of unsigned numeric types

5 participants