Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for reading improperly encoded UINT_8 and UINT_16 Parquet data #7055

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 24 additions & 3 deletions parquet/src/arrow/array_reader/primitive_array.rs
Original file line number Diff line number Diff line change
Expand Up @@ -23,11 +23,10 @@ use crate::column::page::PageIterator;
use crate::data_type::{DataType, Int96};
use crate::errors::{ParquetError, Result};
use crate::schema::types::ColumnDescPtr;
use arrow_array::Decimal256Array;
use arrow_array::{
builder::TimestampNanosecondBufferBuilder, ArrayRef, BooleanArray, Decimal128Array,
Float32Array, Float64Array, Int32Array, Int64Array, TimestampNanosecondArray, UInt32Array,
UInt64Array,
Decimal256Array, Float32Array, Float64Array, Int32Array, Int64Array, TimestampNanosecondArray,
UInt16Array, UInt32Array, UInt64Array, UInt8Array,
};
use arrow_buffer::{i256, BooleanBuffer, Buffer};
use arrow_data::ArrayDataBuilder;
Expand Down Expand Up @@ -210,7 +209,29 @@ where
// These are:
// - date64: cast int32 to date32, then date32 to date64.
// - decimal: cast int32 to decimal, int64 to decimal
//
// Some Parquet writers do not properly write UINT_8 and UINT_16 types
// (they will emit a negative 32-bit integer in some cases). To handle
// these incorrect files, we need to do some explicit casting to unsigned,
// rather than relying on num::cast::cast as used by arrow-cast (it will not
// cast a negative INT32 to UINT8 or UINT16).
let array = match target_type {
ArrowType::UInt8 if *(array.data_type()) == ArrowType::Int32 => {
let array = array
.as_any()
.downcast_ref::<Int32Array>()
.unwrap()
.unary(|i| i as u8) as UInt8Array;
Arc::new(array) as ArrayRef
}
ArrowType::UInt16 if *(array.data_type()) == ArrowType::Int32 => {
let array = array
.as_any()
.downcast_ref::<Int32Array>()
.unwrap()
.unary(|i| i as u16) as UInt16Array;
Arc::new(array) as ArrayRef
}
ArrowType::Date64 if *(array.data_type()) == ArrowType::Int32 => {
// this is cheap as it internally reinterprets the data
let a = arrow_cast::cast(&array, &ArrowType::Date32)?;
Expand Down
Loading