-
Notifications
You must be signed in to change notification settings - Fork 207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disable Spark Catalog caching for integration tests #501
Disable Spark Catalog caching for integration tests #501
Conversation
Thanks @kevinjqliu I think this change makes sense. I don't think there's ever a reason on the Python side where we want to have the spark caching enabled. On the Iceberg Java side we do have tests which validate the caching catalog behavior when it's enabled/disabled so we don't need to test that through PyIceberg (I think). I've triggered CI if it passes, I'll go ahead and merge |
tests/integration/test_writes.py
Outdated
@@ -355,6 +355,26 @@ def test_data_files(spark: SparkSession, session_catalog: Catalog, arrow_table_w | |||
assert [row.deleted_data_files_count for row in rows] == [0, 0, 1, 0, 0] | |||
|
|||
|
|||
@pytest.mark.integration | |||
def test_multiple_spark_sql_queries(spark: SparkSession, session_catalog: Catalog, arrow_table_with_null: pa.Table) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think a better name would be test_python_writes_with_spark_snapshot_reads
or something more specific than what it currently is . It's mre verbose but I think it captures the goal of the test better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense, added!
Great idea @kevinjqliu ! Thanks for adding this |
Sweet, thanks @kevinjqliu! I'm going to go ahead and merge this now. |
Fix #482
While working on #444, I couldn't get integration tests working since using
spark.sql
to count the number of data files returns the wrong result. This becomes more bizarre when Python's iceberg table state is not the same as Spark's iceberg table state.This PR adds a simple test to verify that "python's iceberg table snapshot id" is the same as "spark's iceberg table's snapshot id".
The culprit is the
spark.sql.catalog.catalog-name.cache-enabled
setting, which defaults toTrue
and caches table metadata.Spark sql calls will use the cached iceberg metadata instead of the updated one.
https://iceberg.apache.org/docs/latest/spark-configuration/#catalog-configuration