Skip to content

Commit 2c6c01c

Browse files
alambtustvold
andauthored
Document Arrow <--> Parquet schema conversion better (#7479)
* Document Arrow <--> Parquet schema conversion better * Add a note about arrow metadata convention * lint * Fix links * clarify what happens with provided schema * More docs * fmt * more claritification * More clarifications * fmt * Update parquet/src/arrow/arrow_reader/mod.rs Co-authored-by: Raphael Taylor-Davies <[email protected]> * Update parquet/src/arrow/mod.rs Co-authored-by: Raphael Taylor-Davies <[email protected]> * tweaks * capitalization OCD * capitalization OCD --------- Co-authored-by: Raphael Taylor-Davies <[email protected]>
1 parent 9e91ef4 commit 2c6c01c

File tree

3 files changed

+86
-22
lines changed

3 files changed

+86
-22
lines changed

parquet/src/arrow/arrow_reader/mod.rs

+23-10
Original file line numberDiff line numberDiff line change
@@ -286,7 +286,10 @@ impl<T> ArrowReaderBuilder<T> {
286286
pub struct ArrowReaderOptions {
287287
/// Should the reader strip any user defined metadata from the Arrow schema
288288
skip_arrow_metadata: bool,
289-
/// If provided used as the schema for the file, otherwise the schema is read from the file
289+
/// If provided used as the schema hint when determining the Arrow schema,
290+
/// otherwise the schema hint is read from the [ARROW_SCHEMA_META_KEY]
291+
///
292+
/// [ARROW_SCHEMA_META_KEY]: crate::arrow::ARROW_SCHEMA_META_KEY
290293
supplied_schema: Option<SchemaRef>,
291294
/// If true, attempt to read `OffsetIndex` and `ColumnIndex`
292295
pub(crate) page_index: bool,
@@ -314,17 +317,27 @@ impl ArrowReaderOptions {
314317
}
315318
}
316319

317-
/// Provide a schema to use when reading the parquet file. If provided it
318-
/// takes precedence over the schema inferred from the file or the schema defined
319-
/// in the file's metadata. If the schema is not compatible with the file's
320-
/// schema an error will be returned when constructing the builder.
320+
/// Provide a schema hint to use when reading the Parquet file.
321+
///
322+
/// If provided, this schema takes precedence over any arrow schema embedded
323+
/// in the metadata (see the [`arrow`] documentation for more details).
324+
///
325+
/// If the provided schema is not compatible with the data stored in the
326+
/// parquet file schema, an error will be returned when constructing the
327+
/// builder.
321328
///
322-
/// This option is only required if you want to cast columns to a different type.
323-
/// For example, if you wanted to cast from an Int64 in the Parquet file to a Timestamp
324-
/// in the Arrow schema.
329+
/// This option is only required if you want to explicitly control the
330+
/// conversion of Parquet types to Arrow types, such as casting a column to
331+
/// a different type. For example, if you wanted to read an Int64 in
332+
/// a Parquet file to a [`TimestampMicrosecondArray`] in the Arrow schema.
333+
///
334+
/// [`arrow`]: crate::arrow
335+
/// [`TimestampMicrosecondArray`]: arrow_array::TimestampMicrosecondArray
336+
///
337+
/// # Notes
325338
///
326-
/// The supplied schema must have the same number of columns as the parquet schema and
327-
/// the column names need to be the same.
339+
/// The provided schema must have the same number of columns as the parquet schema and
340+
/// the column names must be the same.
328341
///
329342
/// # Example
330343
/// ```

parquet/src/arrow/mod.rs

+46-10
Original file line numberDiff line numberDiff line change
@@ -15,13 +15,48 @@
1515
// specific language governing permissions and limitations
1616
// under the License.
1717

18-
//! API for reading/writing
19-
//! Arrow [RecordBatch](arrow_array::RecordBatch)es and
20-
//! [Array](arrow_array::Array)s to/from Parquet Files.
18+
//! API for reading/writing Arrow [`RecordBatch`]es and [`Array`]s to/from
19+
//! Parquet Files.
2120
//!
22-
//! See the [crate-level documentation](crate) for more details.
21+
//! See the [crate-level documentation](crate) for more details on other APIs
2322
//!
24-
//! # Example of writing Arrow record batch to Parquet file
23+
//! # Schema Conversion
24+
//!
25+
//! These APIs ensure that data in Arrow [`RecordBatch`]es written to Parquet are
26+
//! read back as [`RecordBatch`]es with the exact same types and values.
27+
//!
28+
//! Parquet and Arrow have different type systems, and there is not
29+
//! always a one to one mapping between the systems. For example, data
30+
//! stored as a Parquet [`BYTE_ARRAY`] can be read as either an Arrow
31+
//! [`BinaryViewArray`] or [`BinaryArray`].
32+
//!
33+
//! To recover the original Arrow types, the writers in this module add a "hint" to
34+
//! the metadata in the [`ARROW_SCHEMA_META_KEY`] key which records the original Arrow
35+
//! schema. The metadata hint follows the same convention as arrow-cpp based
36+
//! implementations such as `pyarrow`. The reader looks for the schema hint in the
37+
//! metadata to determine Arrow types, and if it is not present, infers the Arrow schema
38+
//! from the Parquet schema.
39+
//!
40+
//! In situations where the embedded Arrow schema is not compatible with the Parquet
41+
//! schema, the Parquet schema takes precedence and no error is raised.
42+
//! See [#1663](https://github.com/apache/arrow-rs/issues/1663)
43+
//!
44+
//! You can also control the type conversion process in more detail using:
45+
//!
46+
//! * [`ArrowSchemaConverter`] control the conversion of Arrow types to Parquet
47+
//! types.
48+
//!
49+
//! * [`ArrowReaderOptions::with_schema`] to explicitly specify your own Arrow schema hint
50+
//! to use when reading Parquet, overriding any metadata that may be present.
51+
//!
52+
//! [`RecordBatch`]: arrow_array::RecordBatch
53+
//! [`Array`]: arrow_array::Array
54+
//! [`BYTE_ARRAY`]: crate::basic::Type::BYTE_ARRAY
55+
//! [`BinaryViewArray`]: arrow_array::BinaryViewArray
56+
//! [`BinaryArray`]: arrow_array::BinaryArray
57+
//! [`ArrowReaderOptions::with_schema`]: arrow_reader::ArrowReaderOptions::with_schema
58+
//!
59+
//! # Example: Writing Arrow `RecordBatch` to Parquet file
2560
//!
2661
//!```rust
2762
//! # use arrow_array::{Int32Array, ArrayRef};
@@ -53,7 +88,7 @@
5388
//! writer.close().unwrap();
5489
//! ```
5590
//!
56-
//! # Example of reading parquet file into arrow record batch
91+
//! # Example: Reading Parquet file into Arrow `RecordBatch`
5792
//!
5893
//! ```rust
5994
//! # use std::fs::File;
@@ -93,11 +128,10 @@
93128
//! println!("Read {} records.", record_batch.num_rows());
94129
//! ```
95130
//!
96-
//! # Example of reading non-uniformly encrypted parquet file into arrow record batch
131+
//! # Example: Reading non-uniformly encrypted parquet file into arrow record batch
97132
//!
98133
//! Note: This requires the experimental `encryption` feature to be enabled at compile time.
99134
//!
100-
//!
101135
#![cfg_attr(feature = "encryption", doc = "```rust")]
102136
#![cfg_attr(not(feature = "encryption"), doc = "```ignore")]
103137
//! # use arrow_array::{Int32Array, ArrayRef};
@@ -168,7 +202,6 @@ pub use self::async_reader::ParquetRecordBatchStreamBuilder;
168202
pub use self::async_writer::AsyncArrowWriter;
169203
use crate::schema::types::{SchemaDescriptor, Type};
170204
use arrow_schema::{FieldRef, Schema};
171-
172205
// continue to export deprecated methods until they are removed
173206
#[allow(deprecated)]
174207
pub use self::schema::arrow_to_parquet_schema;
@@ -178,7 +211,10 @@ pub use self::schema::{
178211
parquet_to_arrow_schema, parquet_to_arrow_schema_by_columns, ArrowSchemaConverter, FieldLevels,
179212
};
180213

181-
/// Schema metadata key used to store serialized Arrow IPC schema
214+
/// Schema metadata key used to store serialized Arrow schema
215+
///
216+
/// The Arrow schema is encoded using the Arrow IPC format, and then base64
217+
/// encoded. This is the same format used by arrow-cpp systems, such as pyarrow.
182218
pub const ARROW_SCHEMA_META_KEY: &str = "ARROW:schema";
183219

184220
/// The value of this metadata key, if present on [`Field::metadata`], will be used

parquet/src/arrow/schema/mod.rs

+17-2
Original file line numberDiff line numberDiff line change
@@ -105,6 +105,12 @@ pub struct FieldLevels {
105105
///
106106
/// Columns not included within [`ProjectionMask`] will be ignored.
107107
///
108+
/// The optional `hint` parameter is the desired Arrow schema. See the
109+
/// [`arrow`] module documentation for more information.
110+
///
111+
/// [`arrow`]: crate::arrow
112+
///
113+
/// # Notes:
108114
/// Where a field type in `hint` is compatible with the corresponding parquet type in `schema`, it
109115
/// will be used, otherwise the default arrow type for the given parquet column type will be used.
110116
///
@@ -192,8 +198,12 @@ pub fn encode_arrow_schema(schema: &Schema) -> String {
192198
BASE64_STANDARD.encode(&len_prefix_schema)
193199
}
194200

195-
/// Mutates writer metadata by storing the encoded Arrow schema.
201+
/// Mutates writer metadata by storing the encoded Arrow schema hint in
202+
/// [`ARROW_SCHEMA_META_KEY`].
203+
///
196204
/// If there is an existing Arrow schema metadata, it is replaced.
205+
///
206+
/// [`ARROW_SCHEMA_META_KEY`]: crate::arrow::ARROW_SCHEMA_META_KEY
197207
pub fn add_encoded_arrow_schema_to_metadata(schema: &Schema, props: &mut WriterProperties) {
198208
let encoded = encode_arrow_schema(schema);
199209

@@ -224,7 +234,12 @@ pub fn add_encoded_arrow_schema_to_metadata(schema: &Schema, props: &mut WriterP
224234

225235
/// Converter for Arrow schema to Parquet schema
226236
///
227-
/// Example:
237+
/// See the documentation on the [`arrow`] module for background
238+
/// information on how Arrow schema is represented in Parquet.
239+
///
240+
/// [`arrow`]: crate::arrow
241+
///
242+
/// # Example:
228243
/// ```
229244
/// # use std::sync::Arc;
230245
/// # use arrow_schema::{Field, Schema, DataType};

0 commit comments

Comments
 (0)