-
Notifications
You must be signed in to change notification settings - Fork 297
Write support #41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write support #41
Changes from 68 commits
7133054
ffecf72
25eb597
4cd493e
a726b1d
b88f736
f53626d
0c665ef
3f79dbd
02430bb
eb4dd62
c891382
aae5a57
cff3a1d
082387e
997b673
9d52906
8893cf3
55f27c9
9a0096b
4f5b710
926d947
50575a8
5482ae0
2fa01f4
580c824
254d7e8
f4ae6c5
760c0d4
3dba41a
bcc5176
6d5fbb1
3309129
12c4699
8ef1a06
aabfb09
149c3ec
17fd689
54e36ab
ab36ec3
d6df342
1398a2f
c426068
1861647
cebc781
4d0d11c
abda552
3cd5829
5f86b15
5044da6
e020efb
0b42471
a41abd0
286cf47
559618c
4153e78
54e75d6
158077c
bbc0b35
d441af9
e395a8f
2a65357
a013f35
abc0741
b817a15
48ba852
664e113
85ac0eb
7e8c04f
7baf3ec
ab020b9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -175,6 +175,104 @@ static_table = StaticTable.from_metadata( | |
|
||
The static-table is considered read-only. | ||
|
||
## Write support | ||
|
||
With PyIceberg 0.6.0 write support is added through Arrow. Let's consider an Arrow Table: | ||
|
||
```python | ||
import pyarrow as pa | ||
|
||
df = pa.Table.from_pylist( | ||
[ | ||
{"city": "Amsterdam", "lat": 52.371807, "long": 4.896029}, | ||
{"city": "San Francisco", "lat": 37.773972, "long": -122.431297}, | ||
{"city": "Drachten", "lat": 53.11254, "long": 6.0989}, | ||
{"city": "Paris", "lat": 48.864716, "long": 2.349014}, | ||
], | ||
) | ||
``` | ||
|
||
Next, create a table based on the schema: | ||
|
||
```python | ||
from pyiceberg.catalog import load_catalog | ||
|
||
catalog = load_catalog("default") | ||
|
||
from pyiceberg.schema import Schema | ||
from pyiceberg.types import NestedField, StringType, DoubleType | ||
|
||
schema = Schema( | ||
NestedField(1, "city", StringType(), required=False), | ||
NestedField(2, "lat", DoubleType(), required=False), | ||
NestedField(3, "long", DoubleType(), required=False), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Isn't There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, the default is the more strict |
||
) | ||
|
||
tbl = catalog.create_table("default.cities", schema=schema) | ||
``` | ||
|
||
Now write the data to the table: | ||
|
||
<!-- prettier-ignore-start --> | ||
|
||
!!! note inline end "Fast append" | ||
PyIceberg default to the [fast append](https://iceberg.apache.org/spec/#snapshots) to minimize the amount of data written. This enables quick writes, reducing the possibility of conflicts. The downside of the fast append is that it creates more metadata than a normal commit. [Compaction is planned](https://github.com/apache/iceberg-python/issues/270) and will automatically rewrite all the metadata when a threshold is hit, to maintain performant reads. | ||
|
||
<!-- prettier-ignore-end --> | ||
|
||
```python | ||
tbl.append(df) | ||
|
||
# or | ||
|
||
tbl.overwrite(df) | ||
``` | ||
|
||
The data is written to the table, and when the table is read using `tbl.scan().to_arrow()`: | ||
|
||
``` | ||
pyarrow.Table | ||
city: string | ||
lat: double | ||
long: double | ||
---- | ||
city: [["Amsterdam","San Francisco","Drachten","Paris"]] | ||
lat: [[52.371807,37.773972,53.11254,48.864716]] | ||
long: [[4.896029,-122.431297,6.0989,2.349014]] | ||
``` | ||
|
||
You both can use `append(df)` or `overwrite(df)` since there is no data yet. If we want to add more data, we can use `.append()` again: | ||
|
||
```python | ||
df = pa.Table.from_pylist( | ||
[{"city": "Groningen", "lat": 53.21917, "long": 6.56667}], | ||
) | ||
|
||
tbl.append(df) | ||
``` | ||
|
||
When reading the table `tbl.scan().to_arrow()` you can see that `Groningen` is now also part of the table: | ||
|
||
``` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. While working on this, I also checked the field-ids:
|
||
pyarrow.Table | ||
city: string | ||
lat: double | ||
long: double | ||
---- | ||
city: [["Amsterdam","San Francisco","Drachten","Paris"],["Groningen"]] | ||
lat: [[52.371807,37.773972,53.11254,48.864716],[53.21917]] | ||
long: [[4.896029,-122.431297,6.0989,2.349014],[6.56667]] | ||
``` | ||
|
||
The nested lists indicate the different Arrow buffers, where the first write results into a buffer, and the second append in a separate buffer. This is expected since it will read two parquet files. | ||
|
||
<!-- prettier-ignore-start --> | ||
|
||
!!! example "Under development" | ||
Writing using PyIceberg is still under development. Support for [partial overwrites](https://github.com/apache/iceberg-python/issues/268) and writing to [partitioned tables](https://github.com/apache/iceberg-python/issues/208) is planned and being worked on. | ||
|
||
<!-- prettier-ignore-end --> | ||
|
||
## Schema evolution | ||
|
||
PyIceberg supports full schema evolution through the Python API. It takes care of setting the field-IDs and makes sure that only non-breaking changes are done (can be overriden). | ||
|
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -105,7 +105,11 @@ | |
OutputFile, | ||
OutputStream, | ||
) | ||
from pyiceberg.manifest import DataFile, FileFormat | ||
from pyiceberg.manifest import ( | ||
DataFile, | ||
DataFileContent, | ||
FileFormat, | ||
) | ||
from pyiceberg.schema import ( | ||
PartnerAccessor, | ||
PreOrderSchemaVisitor, | ||
|
@@ -119,8 +123,9 @@ | |
visit, | ||
visit_with_partner, | ||
) | ||
from pyiceberg.table import WriteTask | ||
from pyiceberg.transforms import TruncateTransform | ||
from pyiceberg.typedef import EMPTY_DICT, Properties | ||
from pyiceberg.typedef import EMPTY_DICT, Properties, Record | ||
from pyiceberg.types import ( | ||
BinaryType, | ||
BooleanType, | ||
|
@@ -1443,18 +1448,15 @@ def parquet_path_to_id_mapping( | |
|
||
|
||
def fill_parquet_file_metadata( | ||
df: DataFile, | ||
data_file: DataFile, | ||
parquet_metadata: pq.FileMetaData, | ||
file_size: int, | ||
stats_columns: Dict[int, StatisticsCollector], | ||
parquet_column_mapping: Dict[str, int], | ||
) -> None: | ||
""" | ||
Compute and fill the following fields of the DataFile object. | ||
|
||
- file_format | ||
- record_count | ||
- file_size_in_bytes | ||
- column_sizes | ||
- value_counts | ||
- null_value_counts | ||
|
@@ -1464,11 +1466,8 @@ def fill_parquet_file_metadata( | |
- split_offsets | ||
|
||
Args: | ||
df (DataFile): A DataFile object representing the Parquet file for which metadata is to be filled. | ||
data_file (DataFile): A DataFile object representing the Parquet file for which metadata is to be filled. | ||
parquet_metadata (pyarrow.parquet.FileMetaData): A pyarrow metadata object. | ||
file_size (int): The total compressed file size cannot be retrieved from the metadata and hence has to | ||
be passed here. Depending on the kind of file system and pyarrow library call used, different | ||
ways to obtain this value might be appropriate. | ||
stats_columns (Dict[int, StatisticsCollector]): The statistics gathering plan. It is required to | ||
set the mode for column metrics collection | ||
""" | ||
|
@@ -1565,13 +1564,56 @@ def fill_parquet_file_metadata( | |
del upper_bounds[field_id] | ||
del null_value_counts[field_id] | ||
|
||
df.file_format = FileFormat.PARQUET | ||
df.record_count = parquet_metadata.num_rows | ||
df.file_size_in_bytes = file_size | ||
df.column_sizes = column_sizes | ||
df.value_counts = value_counts | ||
df.null_value_counts = null_value_counts | ||
df.nan_value_counts = nan_value_counts | ||
df.lower_bounds = lower_bounds | ||
df.upper_bounds = upper_bounds | ||
df.split_offsets = split_offsets | ||
data_file.record_count = parquet_metadata.num_rows | ||
data_file.column_sizes = column_sizes | ||
data_file.value_counts = value_counts | ||
data_file.null_value_counts = null_value_counts | ||
data_file.nan_value_counts = nan_value_counts | ||
data_file.lower_bounds = lower_bounds | ||
data_file.upper_bounds = upper_bounds | ||
data_file.split_offsets = split_offsets | ||
|
||
|
||
def write_file(table: Table, tasks: Iterator[WriteTask]) -> Iterator[DataFile]: | ||
task = next(tasks) | ||
|
||
try: | ||
_ = next(tasks) | ||
# If there are more tasks, raise an exception | ||
raise NotImplementedError("Only unpartitioned writes are supported: https://github.com/apache/iceberg-python/issues/208") | ||
except StopIteration: | ||
pass | ||
|
||
file_path = f'{table.location()}/data/{task.generate_data_file_filename("parquet")}' | ||
file_schema = schema_to_pyarrow(table.schema()) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hi Fokko! I am working with @syun64 to test out the impending write feature. During the test, we realized the field ids are not being set in the written parquet file. The field_ids not written correctly in the parquet (current behavior) looks like:
and the parquet schema after using a different metadata key for field id in the arrow schema to write the parquet file looks like:
We feel it is a peculiar issue with pyarrow.parquet.ParquetWriter where we need to define the field_ids in the metadata of the pyarrow.schema conforming to a particular format like "PARQUET:field_id" instead of "field_id". There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks @jqin61 for testing this as it is paramount that the field-IDs are written properly. I'm able to reproduce this locally:
After changing this to
Thanks for flagging this! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if I have
Should I expect There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hey @robtandy thanks for chiming in here. I think the PyArrow to schema should also include the field-id metadata. When you create a new table, it should re-assign the field-ids if they are missing. |
||
|
||
collected_metrics: List[pq.FileMetaData] = [] | ||
fo = table.io.new_output(file_path) | ||
with fo.create(overwrite=True) as fos: | ||
with pq.ParquetWriter(fos, schema=file_schema, version="1.0", metadata_collector=collected_metrics) as writer: | ||
writer.write_table(task.df) | ||
|
||
data_file = DataFile( | ||
content=DataFileContent.DATA, | ||
file_path=file_path, | ||
file_format=FileFormat.PARQUET, | ||
partition=Record(), | ||
file_size_in_bytes=len(fo), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This should also come from the write if possible so we don't have a S3 request here. |
||
sort_order_id=task.sort_order_id, | ||
# Just copy these from the table for now | ||
spec_id=table.spec().spec_id, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since this is an unpartitioned write, we need to ensure that this is the unpartitioned spec in the table. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We check if the partition spec is empty:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since Wouldn't it be easy to just pass the spec ID and partition tuple (an empty |
||
equality_ids=None, | ||
key_metadata=None, | ||
) | ||
|
||
if len(collected_metrics) != 1: | ||
# One file has been written | ||
raise ValueError(f"Expected 1 entry, got: {collected_metrics}") | ||
|
||
fill_parquet_file_metadata( | ||
data_file=data_file, | ||
parquet_metadata=collected_metrics[0], | ||
rdblue marked this conversation as resolved.
Show resolved
Hide resolved
|
||
stats_columns=compute_statistics_plan(table.schema(), table.properties), | ||
parquet_column_mapping=parquet_path_to_id_mapping(table.schema()), | ||
) | ||
return iter([data_file]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this example! Made it really easy to test out.
The example works great cut & pasted into a REPL. I also tested modifications to the dataframe schema passed to append and it does the right thing. I get a schema error for a few cases:
long
string
instead ofdouble
country
Looks like Arrow requires that the schema matches, which is great.
It would be nice to allow some type promotion in the future. I'm not sure whether arrow would automatically write floats into double columns, for example. I would also like to make sure we have better error messages, not just "ValueError: Table schema does not match schema used to create file: ...". Those will be good follow ups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think this ties into the work that @syun64 is doing where we have to make sure that we map the fields correctly, and then I think we can add options to massage the Arrow schema into the Iceberg one (which should be leading).
We can create a
visitorWithPartner
that will see if the promotions are possible. One that comes to my mind directly, is checking if there are any nulls. Arrow marks the schemas as nullable by default, while there are no nulls.