@@ -191,7 +191,7 @@ DataFusion CLI v16.0.0
191
191
2 rows in set. Query took 0.007 seconds.
192
192
` ` `
193
193
194
- # # Creating external tables
194
+ # # Creating External Tables
195
195
196
196
It is also possible to create a table backed by files by explicitly
197
197
via ` CREATE EXTERNAL TABLE` as shown below. Filemask wildcards supported
@@ -425,6 +425,13 @@ Available commands inside DataFusion CLI are:
425
425
> \h function
426
426
` ` `
427
427
428
+ # # Supported SQL
429
+
430
+ In addition to the normal [SQL supported in DataFusion], ` datafusion-cli` also
431
+ supports additional statements and commands:
432
+
433
+ [sql supported in datafusion]: sql/index.rst
434
+
428
435
- Show configuration options
429
436
430
437
` SHOW ALL [VERBOSE]`
@@ -467,6 +474,66 @@ Available commands inside DataFusion CLI are:
467
474
> SET datafusion.execution.batch_size to 1024;
468
475
` ` `
469
476
477
+ - ` parquet_metadata` table function
478
+
479
+ The ` parquet_metadata` table function can be used to inspect detailed metadata
480
+ about a parquet file such as statistics, sizes, and other information. This can
481
+ be helpful to understand how parquet files are structured.
482
+
483
+ For example, to see information about the ` " WatchID" ` column in the
484
+ ` hits.parquet` file, you can use:
485
+
486
+ ` ` ` sql
487
+ SELECT path_in_schema, row_group_id, row_group_num_rows, stats_min, stats_max, total_compressed_size
488
+ FROM parquet_metadata(' hits.parquet' )
489
+ WHERE path_in_schema = ' "WatchID"'
490
+ LIMIT 3;
491
+
492
+ +----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
493
+ | path_in_schema | row_group_id | row_group_num_rows | stats_min | stats_max | total_compressed_size |
494
+ +----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
495
+ | " WatchID" | 0 | 450560 | 4611687214012840539 | 9223369186199968220 | 3883759 |
496
+ | " WatchID" | 1 | 612174 | 4611689135232456464 | 9223371478009085789 | 5176803 |
497
+ | " WatchID" | 2 | 344064 | 4611692774829951781 | 9223363791697310021 | 3031680 |
498
+ +----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
499
+ 3 rows in set. Query took 0.053 seconds.
500
+ ` ` `
501
+
502
+ The returned table has the following columns for each row for each column chunk
503
+ in the file. Please refer to the [Parquet Documentation] for more information.
504
+
505
+ [parquet documentation]: https://parquet.apache.org/
506
+
507
+ | column_name | data_type | Description |
508
+ | ----------------------- | --------- | --------------------------------------------------------------------------------------------------- |
509
+ | filename | Utf8 | Name of the file |
510
+ | row_group_id | Int64 | Row group index the column chunk belongs to |
511
+ | row_group_num_rows | Int64 | Count of rows stored in the row group |
512
+ | row_group_num_columns | Int64 | Total number of columns in the row group (same for all row groups) |
513
+ | row_group_bytes | Int64 | Number of bytes used to store the row group (not including metadata) |
514
+ | column_id | Int64 | ID of the column |
515
+ | file_offset | Int64 | Offset within the file that this column chunk' s data begins |
516
+ | num_values | Int64 | Total number of values in this column chunk |
517
+ | path_in_schema | Utf8 | "Path" (column name) of the column chunk in the schema |
518
+ | type | Utf8 | Parquet data type of the column chunk |
519
+ | stats_min | Utf8 | The minimum value for this column chunk, if stored in the statistics, cast to a string |
520
+ | stats_max | Utf8 | The maximum value for this column chunk, if stored in the statistics, cast to a string |
521
+ | stats_null_count | Int64 | Number of null values in this column chunk, if stored in the statistics |
522
+ | stats_distinct_count | Int64 | Number of distinct values in this column chunk, if stored in the statistics |
523
+ | stats_min_value | Utf8 | Same as `stats_min` |
524
+ | stats_max_value | Utf8 | Same as `stats_max` |
525
+ | compression | Utf8 | Block level compression (e.g. `SNAPPY`) used for this column chunk |
526
+ | encodings | Utf8 | All block level encodings (e.g. `[PLAIN_DICTIONARY, PLAIN, RLE]`) used for this column chunk |
527
+ | index_page_offset | Int64 | Offset in the file of the [`page index`], if any |
528
+ | dictionary_page_offset | Int64 | Offset in the file of the dictionary page, if any |
529
+ | data_page_offset | Int64 | Offset in the file of the first data page, if any |
530
+ | total_compressed_size | Int64 | Number of bytes the column chunk' s data after encoding and compression (what is stored in the file) |
531
+ | total_uncompressed_size | Int64 | Number of bytes the column chunk' s data after encoding |
532
+
533
+ +-------------------------+-----------+-------------+
534
+
535
+ [`page index`]: https://github.com/apache/parquet-format/blob/master/PageIndex.md
536
+
470
537
## Changing Configuration Options
471
538
472
539
All available configuration options can be seen using `SHOW ALL` as described above.
0 commit comments