Skip to content

Commit 1dcdcd4

Browse files
alambcomphead
andauthored
Minor: Document parquet_metadata function (#8852)
* Document parquet_metadata function Co-authored-by: comphead <[email protected]> --------- Co-authored-by: comphead <[email protected]>
1 parent a461c33 commit 1dcdcd4

File tree

1 file changed

+68
-1
lines changed

1 file changed

+68
-1
lines changed

docs/source/user-guide/cli.md

Lines changed: 68 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -191,7 +191,7 @@ DataFusion CLI v16.0.0
191191
2 rows in set. Query took 0.007 seconds.
192192
```
193193
194-
## Creating external tables
194+
## Creating External Tables
195195
196196
It is also possible to create a table backed by files by explicitly
197197
via `CREATE EXTERNAL TABLE` as shown below. Filemask wildcards supported
@@ -425,6 +425,13 @@ Available commands inside DataFusion CLI are:
425425
> \h function
426426
```
427427
428+
## Supported SQL
429+
430+
In addition to the normal [SQL supported in DataFusion], `datafusion-cli` also
431+
supports additional statements and commands:
432+
433+
[sql supported in datafusion]: sql/index.rst
434+
428435
- Show configuration options
429436
430437
`SHOW ALL [VERBOSE]`
@@ -467,6 +474,66 @@ Available commands inside DataFusion CLI are:
467474
> SET datafusion.execution.batch_size to 1024;
468475
```
469476
477+
- `parquet_metadata` table function
478+
479+
The `parquet_metadata` table function can be used to inspect detailed metadata
480+
about a parquet file such as statistics, sizes, and other information. This can
481+
be helpful to understand how parquet files are structured.
482+
483+
For example, to see information about the `"WatchID"` column in the
484+
`hits.parquet` file, you can use:
485+
486+
```sql
487+
SELECT path_in_schema, row_group_id, row_group_num_rows, stats_min, stats_max, total_compressed_size
488+
FROM parquet_metadata('hits.parquet')
489+
WHERE path_in_schema = '"WatchID"'
490+
LIMIT 3;
491+
492+
+----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
493+
| path_in_schema | row_group_id | row_group_num_rows | stats_min | stats_max | total_compressed_size |
494+
+----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
495+
| "WatchID" | 0 | 450560 | 4611687214012840539 | 9223369186199968220 | 3883759 |
496+
| "WatchID" | 1 | 612174 | 4611689135232456464 | 9223371478009085789 | 5176803 |
497+
| "WatchID" | 2 | 344064 | 4611692774829951781 | 9223363791697310021 | 3031680 |
498+
+----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
499+
3 rows in set. Query took 0.053 seconds.
500+
```
501+
502+
The returned table has the following columns for each row for each column chunk
503+
in the file. Please refer to the [Parquet Documentation] for more information.
504+
505+
[parquet documentation]: https://parquet.apache.org/
506+
507+
| column_name | data_type | Description |
508+
| ----------------------- | --------- | --------------------------------------------------------------------------------------------------- |
509+
| filename | Utf8 | Name of the file |
510+
| row_group_id | Int64 | Row group index the column chunk belongs to |
511+
| row_group_num_rows | Int64 | Count of rows stored in the row group |
512+
| row_group_num_columns | Int64 | Total number of columns in the row group (same for all row groups) |
513+
| row_group_bytes | Int64 | Number of bytes used to store the row group (not including metadata) |
514+
| column_id | Int64 | ID of the column |
515+
| file_offset | Int64 | Offset within the file that this column chunk's data begins |
516+
| num_values | Int64 | Total number of values in this column chunk |
517+
| path_in_schema | Utf8 | "Path" (column name) of the column chunk in the schema |
518+
| type | Utf8 | Parquet data type of the column chunk |
519+
| stats_min | Utf8 | The minimum value for this column chunk, if stored in the statistics, cast to a string |
520+
| stats_max | Utf8 | The maximum value for this column chunk, if stored in the statistics, cast to a string |
521+
| stats_null_count | Int64 | Number of null values in this column chunk, if stored in the statistics |
522+
| stats_distinct_count | Int64 | Number of distinct values in this column chunk, if stored in the statistics |
523+
| stats_min_value | Utf8 | Same as `stats_min` |
524+
| stats_max_value | Utf8 | Same as `stats_max` |
525+
| compression | Utf8 | Block level compression (e.g. `SNAPPY`) used for this column chunk |
526+
| encodings | Utf8 | All block level encodings (e.g. `[PLAIN_DICTIONARY, PLAIN, RLE]`) used for this column chunk |
527+
| index_page_offset | Int64 | Offset in the file of the [`page index`], if any |
528+
| dictionary_page_offset | Int64 | Offset in the file of the dictionary page, if any |
529+
| data_page_offset | Int64 | Offset in the file of the first data page, if any |
530+
| total_compressed_size | Int64 | Number of bytes the column chunk's data after encoding and compression (what is stored in the file) |
531+
| total_uncompressed_size | Int64 | Number of bytes the column chunk's data after encoding |
532+
533+
+-------------------------+-----------+-------------+
534+
535+
[`page index`]: https://github.com/apache/parquet-format/blob/master/PageIndex.md
536+
470537
## Changing Configuration Options
471538
472539
All available configuration options can be seen using `SHOW ALL` as described above.

0 commit comments

Comments
 (0)