Skip to content

Commit

Permalink
Docs: Add section on pandas
Browse files Browse the repository at this point in the history
  • Loading branch information
Fokko committed Nov 10, 2023
1 parent 0c8b0b9 commit 66022a6
Showing 1 changed file with 40 additions and 1 deletion.
41 changes: 40 additions & 1 deletion mkdocs/docs/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -318,7 +318,7 @@ In this case it is up to the engine itself to filter the file itself. Below, `to
<!-- prettier-ignore-start -->

!!! note "Requirements"
This requires [PyArrow to be installed](index.md).
This requires [`pyarrow` to be installed](index.md).

<!-- prettier-ignore-end -->

Expand Down Expand Up @@ -346,6 +346,45 @@ tpep_dropoff_datetime: [[2021-04-01 00:47:59.000000,...,2021-05-01 00:14:47.0000

This will only pull in the files that that might contain matching rows.

### Pandas

<!-- prettier-ignore-start -->

!!! note "Requirements"
This requires [`pandas` to be installed](index.md).

<!-- prettier-ignore-end -->

PyIceberg makes it easy to filter out data from a huge table and pull it into a Pandas dataframe locally. This will only fetch Parquet files that that might contain matching data. This will reduce IO and therefore improve performance and reduce cost.

```python
table.scan(
row_filter="trip_distance >= 10.0",
selected_fields=("VendorID", "tpep_pickup_datetime", "tpep_dropoff_datetime"),
).to_pandas()
```

This will return a Pandas dataframe:

```
VendorID tpep_pickup_datetime tpep_dropoff_datetime
0 2 2021-04-01 00:28:05+00:00 2021-04-01 00:47:59+00:00
1 1 2021-04-01 00:39:01+00:00 2021-04-01 00:57:39+00:00
2 2 2021-04-01 00:14:42+00:00 2021-04-01 00:42:59+00:00
3 1 2021-04-01 00:17:17+00:00 2021-04-01 00:43:38+00:00
4 1 2021-04-01 00:24:04+00:00 2021-04-01 00:56:20+00:00
... ... ... ...
116976 2 2021-04-30 23:56:18+00:00 2021-05-01 00:29:13+00:00
116977 2 2021-04-30 23:07:41+00:00 2021-04-30 23:37:18+00:00
116978 2 2021-04-30 23:38:28+00:00 2021-05-01 00:12:04+00:00
116979 2 2021-04-30 23:33:00+00:00 2021-04-30 23:59:00+00:00
116980 2 2021-04-30 23:44:25+00:00 2021-05-01 00:14:47+00:00
[116981 rows x 3 columns]
```

It is recommended to use Pandas 2 or later, because it stores the data in an [Apache Arrow backend](https://datapythonista.me/blog/pandas-20-and-the-arrow-revolution-part-i) which avoids copies of data.

### DuckDB

<!-- prettier-ignore-start -->
Expand Down

0 comments on commit 66022a6

Please sign in to comment.