Minor: Document parquet_metadata function (#8852)

alamb · comphead · web-flow · commit 1dcdcd431187 · 2024-01-14T11:35:39.000-08:00
* Document parquet_metadata function

Co-authored-by: comphead &lt;comphead@users.noreply.github.com&gt;

---------

Co-authored-by: comphead &lt;comphead@users.noreply.github.com&gt;
diff --git a/docs/source/user-guide/cli.md b/docs/source/user-guide/cli.md
@@ -191,7 +191,7 @@ DataFusion CLI v16.0.0
 2 rows in set. Query took 0.007 seconds.
 ```
 
-## Creating external tables
+## Creating External Tables
 
 It is also possible to create a table backed by files by explicitly
 via `CREATE EXTERNAL TABLE` as shown below. Filemask wildcards supported
@@ -425,6 +425,13 @@ Available commands inside DataFusion CLI are:
 > \h function
 ```
 
+## Supported SQL
+
+In addition to the normal [SQL supported in DataFusion], `datafusion-cli` also
+supports additional statements and commands:
+
+[sql supported in datafusion]: sql/index.rst
+
 - Show configuration options
 
 `SHOW ALL [VERBOSE]`
@@ -467,6 +474,66 @@ Available commands inside DataFusion CLI are:
 > SET datafusion.execution.batch_size to 1024;
 ```
 
+- `parquet_metadata` table function
+
+The `parquet_metadata` table function can be used to inspect detailed metadata
+about a parquet file such as statistics, sizes, and other information. This can
+be helpful to understand how parquet files are structured.
+
+For example, to see information about the `"WatchID"` column in the
+`hits.parquet` file, you can use:
+
+```sql
+SELECT path_in_schema, row_group_id, row_group_num_rows, stats_min, stats_max, total_compressed_size
+FROM parquet_metadata('hits.parquet')
+WHERE path_in_schema = '"WatchID"'
+LIMIT 3;
+
++----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
+| path_in_schema | row_group_id | row_group_num_rows | stats_min           | stats_max           | total_compressed_size |
++----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
+| "WatchID"      | 0            | 450560             | 4611687214012840539 | 9223369186199968220 | 3883759               |
+| "WatchID"      | 1            | 612174             | 4611689135232456464 | 9223371478009085789 | 5176803               |
+| "WatchID"      | 2            | 344064             | 4611692774829951781 | 9223363791697310021 | 3031680               |
++----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
+3 rows in set. Query took 0.053 seconds.
+```
+
+The returned table has the following columns for each row for each column chunk
+in the file. Please refer to the [Parquet Documentation] for more information.
+
+[parquet documentation]: https://parquet.apache.org/
+
+| column_name             | data_type | Description                                                                                         |
+| ----------------------- | --------- | --------------------------------------------------------------------------------------------------- |
+| filename                | Utf8      | Name of the file                                                                                    |
+| row_group_id            | Int64     | Row group index the column chunk belongs to                                                         |
+| row_group_num_rows      | Int64     | Count of rows stored in the row group                                                               |
+| row_group_num_columns   | Int64     | Total number of columns in the row group (same for all row groups)                                  |
+| row_group_bytes         | Int64     | Number of bytes used to store the row group (not including metadata)                                |
+| column_id               | Int64     | ID of the column                                                                                    |
+| file_offset             | Int64     | Offset within the file that this column chunk's data begins                                         |
+| num_values              | Int64     | Total number of values in this column chunk                                                         |
+| path_in_schema          | Utf8      | "Path" (column name) of the column chunk in the schema                                              |
+| type                    | Utf8      | Parquet data type of the column chunk                                                               |
+| stats_min               | Utf8      | The minimum value for this column chunk, if stored in the statistics, cast to a string              |
+| stats_max               | Utf8      | The maximum value for this column chunk, if stored in the statistics, cast to a string              |
+| stats_null_count        | Int64     | Number of null values in this column chunk, if stored in the statistics                             |
+| stats_distinct_count    | Int64     | Number of distinct values in this column chunk, if stored in the statistics                         |
+| stats_min_value         | Utf8      | Same as `stats_min`                                                                                 |
+| stats_max_value         | Utf8      | Same as `stats_max`                                                                                 |
+| compression             | Utf8      | Block level compression (e.g. `SNAPPY`) used for this column chunk                                  |
+| encodings               | Utf8      | All block level encodings (e.g. `[PLAIN_DICTIONARY, PLAIN, RLE]`) used for this column chunk        |
+| index_page_offset       | Int64     | Offset in the file of the [`page index`], if any                                                    |
+| dictionary_page_offset  | Int64     | Offset in the file of the dictionary page, if any                                                   |
+| data_page_offset        | Int64     | Offset in the file of the first data page, if any                                                   |
+| total_compressed_size   | Int64     | Number of bytes the column chunk's data after encoding and compression (what is stored in the file) |
+| total_uncompressed_size | Int64     | Number of bytes the column chunk's data after encoding                                              |
+
++-------------------------+-----------+-------------+
+
+[`page index`]: https://github.com/apache/parquet-format/blob/master/PageIndex.md
+
 ## Changing Configuration Options
 
 All available configuration options can be seen using `SHOW ALL` as described above.