Skip to content

Commit 2bce568

Browse files
phillipleblancalambhimadripalBlizzarafindepi
authored
Support converting large dates (i.e. +10999-12-31) from string to Date32 (#7074)
* Support converting large dates (i.e. +10999-12-31) from string to Date32 * Fix lint * Update arrow-cast/src/parse.rs Co-authored-by: Andrew Lamb <[email protected]> * fix: issue introduced in #6833 - less than equal check for scale in decimal conversion (#7070) * fix <= check for scale in decimal conversion * Update arrow-cast/src/cast/mod.rs name change Co-authored-by: Arttu <[email protected]> * remove incorrect comment --------- Co-authored-by: Arttu <[email protected]> * minor: re-export `OffsetBufferBuilder` in `arrow` crate (#7077) * Add another decimal cast edge test case (#7078) * Add another decimal cast edge test case Before 1019f5b this test would fail, as the cast produced 1. 0 is an edge case worth explicitly testing for. * typo/fmt Co-authored-by: Felipe Oliveira Carvalho <[email protected]> --------- Co-authored-by: Felipe Oliveira Carvalho <[email protected]> * Support both 0x01 and 0x02 as type for list of booleans in thrift metadata (#7052) * Support both 0x01 and 0x02 as type for list of booleans * Also support 0 for false inside boolean collections * Use hex notation in tests * Fix LocalFileSystem with range request that ends beyond end of file (#6751) * Fix LocalFileSystem with range request that ends beyond end of file * fix windows * add comment * Seek error * fix seek check * remove windows flag * Get file length from file metadata * Introduce `UnsafeFlag` to manage disabling `ArrayData` validation (#7027) * Introduce UnsafeFlag to manage disabling validation * fix docs * Refactor arrow-ipc: Rename `ArrayReader` to `RecodeBatchDecoder` (#7028) * Rename `ArrayReader` to `RecordBatchDecoder` * Remove alias for `self` * Minor: Update release schedule (#7086) * Minor: Update release schedule * realism * Refactor some decimal-related code and tests (#7062) * Refactor some decimal-related code and tests in preparation for adding Decimal32 and Decimal64 support * Fixed symbol * Apply PR feedback * Fixed format problem * Fixed logical merge conflicts * PR feedback * Refactor arrow-ipc: Move `create_*_array` methods into `RecordBatchDecoder` (#7029) * Move `create_primitive_array` into RecordBatchReader * Move `create_list-array` into RecordBatchReader * Move `create_dictionay_array` into RecordBatchReader * Print Parquet BasicTypeInfo id when present (#7094) * Print Parquet BasicTypeInfo id when present * Improve print_schema documentation * tiny cleanup * Add a custom implementation `LocalFileSystem::list_with_offset` (#7019) * Initial change from Daniel. * Upgrade unit test to be more generic. * Add comments on why we have filter * Cleanup unit tests. * Update object_store/src/local.rs Co-authored-by: Adam Reeve <[email protected]> * Add changes suggested by Adam. * Cleanup match error. * Apply formatting changes suggested by cargo +stable fmt --all. * Apply cosmetic changes suggested by clippy. * Upgrade test_path_with_offset to create temporary directory + files for testing rather than pointing to existing dir. --------- Co-authored-by: Adam Reeve <[email protected]> * fix: first none/empty list in `ListArray` panics in `cast_with_options` (#7065) * fix: first none in `ListArray` panics in `cast_with_options` * simplify * fix * Update arrow-cast/src/cast/list.rs Co-authored-by: Jeffrey Vo <[email protected]> --------- Co-authored-by: Jeffrey Vo <[email protected]> * Benchmarks for Arrow IPC writer (#7090) * Add benchmarks for Arrow IPC writer * Add benchmarks for Arrow IPC writer * reuse target buffer * rename, etc * Add compression type * update --------- Co-authored-by: Andy Grove <[email protected]> * Minor: Clarify documentation on `NullBufferBuilder::allocated_size` (#7089) * Minor: Clarify documentaiton on NullBufferBuilder::allocated_size * add note about why allocations are 64 bytes * Add more tests for edge cases * Add negative test case for incorrectly formatted large dates --------- Co-authored-by: Andrew Lamb <[email protected]> Co-authored-by: Himadri Pal <[email protected]> Co-authored-by: Arttu <[email protected]> Co-authored-by: Piotr Findeisen <[email protected]> Co-authored-by: Felipe Oliveira Carvalho <[email protected]> Co-authored-by: Jörn Horstmann <[email protected]> Co-authored-by: Kyle Barron <[email protected]> Co-authored-by: Curt Hagenlocher <[email protected]> Co-authored-by: Devin Smith <[email protected]> Co-authored-by: Corwin Joy <[email protected]> Co-authored-by: Adam Reeve <[email protected]> Co-authored-by: irenjj <[email protected]> Co-authored-by: Jeffrey Vo <[email protected]> Co-authored-by: Andy Grove <[email protected]>
1 parent a85fc03 commit 2bce568

File tree

2 files changed

+68
-0
lines changed

2 files changed

+68
-0
lines changed

arrow-cast/src/cast/mod.rs

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4229,6 +4229,48 @@ mod tests {
42294229
}
42304230
}
42314231

4232+
#[test]
4233+
fn test_cast_string_with_large_date_to_date32() {
4234+
let array = Arc::new(StringArray::from(vec![
4235+
Some("+10999-12-31"),
4236+
Some("-0010-02-28"),
4237+
Some("0010-02-28"),
4238+
Some("0000-01-01"),
4239+
Some("-0000-01-01"),
4240+
Some("-0001-01-01"),
4241+
])) as ArrayRef;
4242+
let to_type = DataType::Date32;
4243+
let options = CastOptions {
4244+
safe: false,
4245+
format_options: FormatOptions::default(),
4246+
};
4247+
let b = cast_with_options(&array, &to_type, &options).unwrap();
4248+
let c = b.as_primitive::<Date32Type>();
4249+
assert_eq!(3298139, c.value(0)); // 10999-12-31
4250+
assert_eq!(-723122, c.value(1)); // -0010-02-28
4251+
assert_eq!(-715817, c.value(2)); // 0010-02-28
4252+
assert_eq!(c.value(3), c.value(4)); // Expect 0000-01-01 and -0000-01-01 to be parsed the same
4253+
assert_eq!(-719528, c.value(3)); // 0000-01-01
4254+
assert_eq!(-719528, c.value(4)); // -0000-01-01
4255+
assert_eq!(-719893, c.value(5)); // -0001-01-01
4256+
}
4257+
4258+
#[test]
4259+
fn test_cast_invalid_string_with_large_date_to_date32() {
4260+
// Large dates need to be prefixed with a + or - sign, otherwise they are not parsed correctly
4261+
let array = Arc::new(StringArray::from(vec![Some("10999-12-31")])) as ArrayRef;
4262+
let to_type = DataType::Date32;
4263+
let options = CastOptions {
4264+
safe: false,
4265+
format_options: FormatOptions::default(),
4266+
};
4267+
let err = cast_with_options(&array, &to_type, &options).unwrap_err();
4268+
assert_eq!(
4269+
err.to_string(),
4270+
"Cast error: Cannot cast string '10999-12-31' to value of Date32 type"
4271+
);
4272+
}
4273+
42324274
#[test]
42334275
fn test_cast_string_format_yyyymmdd_to_date32() {
42344276
let a0 = Arc::new(StringViewArray::from(vec![

arrow-cast/src/parse.rs

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -595,6 +595,32 @@ const EPOCH_DAYS_FROM_CE: i32 = 719_163;
595595
const ERR_NANOSECONDS_NOT_SUPPORTED: &str = "The dates that can be represented as nanoseconds have to be between 1677-09-21T00:12:44.0 and 2262-04-11T23:47:16.854775804";
596596

597597
fn parse_date(string: &str) -> Option<NaiveDate> {
598+
// If the date has an extended (signed) year such as "+10999-12-31" or "-0012-05-06"
599+
//
600+
// According to [ISO 8601], years have:
601+
// Four digits or more for the year. Years in the range 0000 to 9999 will be pre-padded by
602+
// zero to ensure four digits. Years outside that range will have a prefixed positive or negative symbol.
603+
//
604+
// [ISO 8601]: https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/time/format/DateTimeFormatter.html#ISO_LOCAL_DATE
605+
if string.starts_with('+') || string.starts_with('-') {
606+
// Skip the sign and look for the hyphen that terminates the year digits.
607+
// According to ISO 8601 the unsigned part must be at least 4 digits.
608+
let rest = &string[1..];
609+
let hyphen = rest.find('-')?;
610+
if hyphen < 4 {
611+
return None;
612+
}
613+
// The year substring is the sign and the digits (but not the separator)
614+
// e.g. for "+10999-12-31", hyphen is 5 and s[..6] is "+10999"
615+
let year: i32 = string[..hyphen + 1].parse().ok()?;
616+
// The remainder should begin with a '-' which we strip off, leaving the month-day part.
617+
let remainder = string[hyphen + 1..].strip_prefix('-')?;
618+
let mut parts = remainder.splitn(2, '-');
619+
let month: u32 = parts.next()?.parse().ok()?;
620+
let day: u32 = parts.next()?.parse().ok()?;
621+
return NaiveDate::from_ymd_opt(year, month, day);
622+
}
623+
598624
if string.len() > 10 {
599625
// Try to parse as datetime and return just the date part
600626
return string_to_datetime(&Utc, string)

0 commit comments

Comments
 (0)