Skip to content

Commit

Permalink
docs: Add sqlcatalog and local fs warehouse (#361)
Browse files Browse the repository at this point in the history
* add sqlcatalog and local fs warehouse

* make lint

* Apply suggestions from code review

Co-authored-by: Fokko Driesprong <[email protected]>

---------

Co-authored-by: Fokko Driesprong <[email protected]>
  • Loading branch information
kevinjqliu and Fokko authored Feb 4, 2024
1 parent a4856bc commit fa15877
Showing 1 changed file with 32 additions and 3 deletions.
35 changes: 32 additions & 3 deletions mkdocs/docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,29 @@ You either need to install `s3fs`, `adlfs`, `gcs`, or `pyarrow` to be able to fe

Iceberg leverages the [catalog to have one centralized place to organize the tables](https://iceberg.apache.org/catalog/). This can be a traditional Hive catalog to store your Iceberg tables next to the rest, a vendor solution like the AWS Glue catalog, or an implementation of Icebergs' own [REST protocol](https://github.com/apache/iceberg/tree/main/open-api). Checkout the [configuration](configuration.md) page to find all the configuration details.

For the sake of demonstration, we'll configure the catalog to use the `SqlCatalog` implementation, which will store information in a local `sqlite` database. We'll also configure the catalog to store data files in the local filesystem instead of an object store. This should not be used in production due to the limited scalability.

Create a temporary location for Iceberg:

```shell
mkdir /tmp/warehouse
```

Open a Python 3 REPL to set up the catalog:

```python
from pyiceberg.catalog.sql import SqlCatalog

warehouse_path = "/tmp/warehouse"
catalog = SqlCatalog(
"default",
**{
"uri": f"sqlite:///{warehouse_path}/pyiceberg_catalog.db",
"warehouse": f"file://{warehouse_path}",
},
)
```

## Write a PyArrow dataframe

Let's take the Taxi dataset, and write this to an Iceberg table.
Expand All @@ -83,9 +106,7 @@ df = pq.read_table("/tmp/yellow_tripdata_2023-01.parquet")
Create a new Iceberg table:

```python
from pyiceberg.catalog import load_catalog

catalog = load_catalog("default")
catalog.create_namespace("default")

table = catalog.create_table(
"default.taxi_dataset",
Expand Down Expand Up @@ -158,6 +179,14 @@ df = table.scan(row_filter="tip_per_mile > 0").to_arrow()
len(df)
```

### Explore Iceberg data and metadata files

Since the catalog was configured to use the local filesystem, we can explore how Iceberg saved data and metadata files from the above operations.

```shell
find /tmp/warehouse/
```

## More details

For the details, please check the [CLI](cli.md) or [Python API](api.md) page.

0 comments on commit fa15877

Please sign in to comment.