Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating Delete Vectors using Java API or Spark #11968

Closed
piyushdubey opened this issue Jan 15, 2025 · 5 comments · Fixed by #11986
Closed

Creating Delete Vectors using Java API or Spark #11968

piyushdubey opened this issue Jan 15, 2025 · 5 comments · Fixed by #11986
Labels
question Further information is requested

Comments

@piyushdubey
Copy link

Query engine

Spark

Question

Q1 - Is it possible to create Deletion Vectors in Apache Iceberg? Is a Deletion Vector file generated for a specific Deletion type Positional vs Equality?

Q2 - Looking for pointers on creating Delete Vectors in Apache Iceberg using Spark or the Java API. So far I have tried creating a table using Spark and inserting a few rows and deleting a few. With "write.delete.vector.enabled" set to true for the table. The hope is that it will generate Deletion Vector(s). Am I missing any steps here?

Here's the code snippet - https://github.com/piyushdubey/dataformats/blob/main/src/main/java/net/piyushdubey/data/IcebergTableOperations.java

Appreciate any pointers on this!

@piyushdubey piyushdubey added the question Further information is requested label Jan 15, 2025
@nastra
Copy link
Contributor

nastra commented Jan 15, 2025

Not sure where you got the write.delete.vector.enabled property from but that property doesn't exist.
The DV code hasn't been officially released yet. You could either wait for the 1.8.0 release or use a nightly snapshot.

Q1/Q2: DVs will automatically be produced when the table's format-version is set to 3 and when positional deletes are written. They are not produced for equality deletes. See #11561 for how they are produced within Spark.

DV work in general is being tracked by #11122.

Let me know if that helps or whether you have any other questions.

@Fokko
Copy link
Contributor

Fokko commented Jan 15, 2025

I was trying out the Nightly Snapshot for PyIceberg (apache/iceberg-python#1516), and noticed that we don't produce any deletion vectors (yet):

spark-sql (default)>       CREATE OR REPLACE TABLE test_deletion_vectors (
                   >         dt     date,
                   >         number integer,
                   >         letter string
                   >       )
                   >       USING iceberg
                   >       TBLPROPERTIES (
                   >         'write.delete.mode'='merge-on-read',
                   >         'write.update.mode'='merge-on-read',
                   >         'write.merge.mode'='merge-on-read',
                   >         'format-version'='3'
                   >       );
Time taken: 0.112 seconds
spark-sql (default)> 
                   >     INSERT INTO test_deletion_vectors
                   >     VALUES
                   >         (CAST('2023-03-01' AS date), 1, 'a'),
                   >         (CAST('2023-03-02' AS date), 2, 'b'),
                   >         (CAST('2023-03-03' AS date), 3, 'c'),
                   >         (CAST('2023-03-04' AS date), 4, 'd'),
                   >         (CAST('2023-03-05' AS date), 5, 'e'),
                   >         (CAST('2023-03-06' AS date), 6, 'f'),
                   >         (CAST('2023-03-07' AS date), 7, 'g'),
                   >         (CAST('2023-03-08' AS date), 8, 'h'),
                   >         (CAST('2023-03-09' AS date), 9, 'i'),
                   >         (CAST('2023-03-10' AS date), 10, 'j'),
                   >         (CAST('2023-03-11' AS date), 11, 'k'),
                   >         (CAST('2023-03-12' AS date), 12, 'l');
Time taken: 1.422 seconds
spark-sql (default)> 
                   >   DELETE FROM test_deletion_vectors WHERE number = 9;

I would expect a Puffin file here:

Image

It is a V3 table:

spark-sql (default)> DESCRIBE TABLE EXTENDED test_deletion_vectors;
dt                  	date                	                    
number              	int                 	                    
letter              	string              	                    
                    	                    	                    
# Metadata Columns  	                    	                    
_spec_id            	int                 	                    
_partition          	struct<>            	                    
_file               	string              	                    
_pos                	bigint              	                    
_deleted            	boolean             	                    
                    	                    	                    
# Detailed Table Information	                    	                    
Name                	rest.default.test_deletion_vectors	                    
Type                	MANAGED             	                    
Location            	s3://warehouse/default/test_deletion_vectors	                    
Provider            	iceberg             	                    
Owner               	root                	                    
Table Properties    	[created-at=2025-01-15T10:13:15.609042430Z,current-snapshot-id=644093306277329092,format=iceberg/parquet,format-version=3,write.delete.mode=merge-on-read,write.merge.mode=merge-on-read,write.parquet.compression-codec=zstd,write.update.mode=merge-on-read]	                    
Time taken: 0.037 seconds, Fetched 18 row(s)

@amogh-jahagirdar
Copy link
Contributor

amogh-jahagirdar commented Jan 17, 2025

@Fokko is this spark 3.5? I'll have a backport up for 3.4 shortly, so if it's 3.4 or earlier you're probably not going to see those?

Edit: See that it's Spark 3.5 .... that's strange.

@amogh-jahagirdar
Copy link
Contributor

amogh-jahagirdar commented Jan 17, 2025

Ok did some local testing, I think the issue is really in the output file path. We are outputting DVs but the suffix of the file is the same as the configured V2 delete file, so for instance the file is called "foo.parquet" but it's really a PUFFIN file with the expected DVs when you inspect it. Figuring out why we're outputting this suffix...

@Fokko
Copy link
Contributor

Fokko commented Jan 20, 2025

@amogh-jahagirdar Thanks Amogh, I checked and it looks good now:

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants