Spark: Action to remove missing files #12106

wypoon · 2025-01-25T23:56:37Z

In case data and/or delete files are inadvertently deleted from the storage, an Iceberg table becomes unreadable.
We provide a Spark action for "repairing" such a table, by removing the missing files from the metadata.

gaborkaszab

Thanks for working on this @wypoon ! I went through for my own education and left some comments mostly related to the introduced API.

gaborkaszab · 2025-01-28T09:11:44Z

api/src/main/java/org/apache/iceberg/RemoveMissingFiles.java

+ * <p>When committing, these changes will be applied to the latest table snapshot. Commit conflicts
+ * will be resolved by applying the changes to the new latest snapshot and reattempting the commit.
+ */
+public interface RemoveMissingFiles extends SnapshotUpdate<RemoveMissingFiles> {


Initially I didn't understand why you needed this new API and why you didn't use the existing DeleteFiles API. I think I get the motivation is. Is this because DeleteFiles can only delete data files and not delete files?

Even now understanding the motivation I feel confused about this new API for the following reasons:

It partially overlaps with the DeleteFiles API. I mean the part when we delete data files.

The name might be misleading: it says "MissingFiles" but in fact it does nothing special wrt the files being missing or not. It just removes files from the table metadata.

Not sure how one could use this API without using the new Spark action. (see my comment below).

I probably miss enough experience on this area but these might be some options we have here (just throwing some ideas):

If the purpose of this new API is to provide ability to remove DeleteFiles then we might want to re-visit the existing DeleteFiles API to see if we can extend it. There probably is a reason why it doesn't give support for removing DeleteFiles, but would be nice to understand.

If DeleteFiles API can't be changed to this purpose, then in order to use this API without the Spark action I think a single function is enough having a path parameter. DeleteFiles has the same too, we 'just' have to improve it to take care of delete files too if possible. With this approach the name of this API won't be correct, because it has nothing to do with missing files.

This API could be smarter than just simply removing files from the table. Since it's called RemoveMissingFiles, it could also do the detection/collection of such missing files and then remove them. With this approach a 'RecoverTable' class/interface name might be more verbose having a 'removeMissingContentFiles' function.

With the above approach, later on we would have the API to add further recovery functions to recover from missing metadata.jsons, or missing manifest or snapshot files.

gaborkaszab · 2025-01-28T09:16:02Z

api/src/main/java/org/apache/iceberg/RemoveMissingFiles.java

+   * @param file a DataFile to remove from the table
+   * @return this for method chaining
+   */
+  RemoveMissingFiles deleteFile(DataFile file);


I'm thinking of how this interface would be used if not from the introduced Spark action. For instance when some user sees that their table can't be loaded due to a missing file, how would they use this API to fix the table? (Assuming they don't have Spark but they can use the Java API)?

They know a file path from the error message, but should they then figure out if it's a data or delete file? It's probably possible but requires an extra manual step. And then they have to create a DataFile/DeleteFile object somehow so that they can call this API. This seems more problematic, and adds another manual step.

I think if users want to use this API to fix tables but not using the Spark action, then the API should be rather something like this:
RemoveMissingFiles deleteFile(CharSequence path);
Similarly to what DeleteFiles does.

The implementation of DeleteFiles::deleteFile(CharSequence) calls MergingSnapshotProducer::deleteFile(CharSequence):

protected void delete(CharSequence path) { // this is an old call that never worked for delete files and can only be used to remove data // files. filterManager.delete(path); }

It cannot be used to remove delete files, only data files.
It is better not to call delete(CharSequence), where you have no guarantees on what kind of file the path is for. It is better to call only delete(DataFile) and delete(DeleteFile). Therefore it makes sense to have that as the API.

gaborkaszab · 2025-01-28T09:17:43Z

api/src/main/java/org/apache/iceberg/RemoveMissingFiles.java

+   *
+   * @return this for method chaining
+   */
+  RemoveMissingFiles validateFilesExist();


The purpose of this API is to drop missing files. I don't think it makes sense to validate existence of such files.

This does not validate if the files to be removed exist in storage; it validates that they exist in the current metadata. It does make sense to call this.

gaborkaszab · 2025-01-28T09:24:02Z

api/src/main/java/org/apache/iceberg/Table.java

+   *
+   * @return a new {@link RemoveMissingFiles}
+   */
+  default RemoveMissingFiles newRemoveFiles() {


The name newRemoveFiles() might be confusing for users because there is already a newDelete() and I'm not sure how clear what the difference is.

gaborkaszab · 2025-01-28T09:50:00Z

core/src/main/java/org/apache/iceberg/BaseRemoveFiles.java

+package org.apache.iceberg;
+
+/** {@link RemoveMissingFiles} implementation. */
+public class BaseRemoveFiles extends MergingSnapshotProducer<RemoveMissingFiles>


I find a huge overlap between this class and StreamingDelete. I'm wondering if we can somehow avoid code duplication. Inheritance maybe?

gaborkaszab · 2025-01-28T09:51:26Z

core/src/main/java/org/apache/iceberg/BaseRemoveFiles.java

+      return DataOperations.DELETE;
+    }
+
+    return DataOperations.OVERWRITE;


Just for my understanding: Why is the operation an OVERWRITE if we delete DeleteFiles? Is it because there are going to be rows that were meant to be deleted that might re-appear again?
I checked the existing options and I don't think any of them applies comfortable to what we try to do here. Maybe introducing a new RECOVER or something?
Well, another dilemma here is that is this class for purely removing files in general, or does it have anything to do with missing files and table recovery.

gaborkaszab · 2025-01-28T09:58:16Z

core/src/main/java/org/apache/iceberg/BaseRemoveFiles.java

+package org.apache.iceberg;
+
+/** {@link RemoveMissingFiles} implementation. */
+public class BaseRemoveFiles extends MergingSnapshotProducer<RemoveMissingFiles>


The parent class is RemoveMissingFiles while this one is BaseRemoveFiles. In the name we loose the information that this has something to do with missing files.
I feel the reason for this is that this patch can't decide either if this new interface is for handling missing files and table recovery in general, or it is only introduced because the existing DeleteFiles API doesn't remove DeleteFiles.

gaborkaszab · 2025-01-28T10:29:31Z

...v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RemoveMissingFilesSparkAction.java

+    List<String> removedDeleteFiles = Lists.newArrayList();
+
+    for (DataFile f : dataFiles) {
+      if (!fileIO.newInputFile(f.location()).exists()) {


These sequentially do an existence check on the storage for each of the data/delete files in the table. Not sure about how Spark actions do these in general (lack of experience), but I saw in various parts of the code that such storage operations on high volumes are performed using an ExecutorService in a parallel manner. Do you think it would make sense adding some parallelism here?
I'm wondering how a sequential effort performs in general on 10k, 100k etc number of files. If the time for these checks is negligible, then it's fine as it is. However, in object stores, this might be slow, not sure.

I agree that this can be slow.
dataEntries and deleteEntries are DataFrames (Dataset<Row>s) computed on executors; ideally, I'd like to filter them to entries for files that exist in storage. I don't recall the exact issue I ran into, but I had tried something earlier along that line and ran into some Iceberg classes not being serializable so I could not send the task to the executors. So I simply collected everything back to the driver and processed things there. Let me revisit that.
Calling the API to remove the missing files has to be done on the driver though.

I have reworked this. The file existence is checked in executors now. I didn't know about org.apache.iceberg.SerializableTable and its subclass org.apache.iceberg.spark.source.SerializableTableWithSize before.

wypoon · 2025-01-28T19:10:11Z

I am open to better naming for the new API.
I want to support removing both delete and data files, so I cannot use DeleteFiles. See DataOperations.DELETE:

  /**
   * Data is deleted from the table and no data is added.
   *
   * <p>This operation is implemented by {@link DeleteFiles}.
   */
  public static final String DELETE = "delete";

and StreamingDelete::operation():

  @Override
  protected String operation() {
    return DataOperations.DELETE;
  }

Removing delete files will undelete deleted rows, i.e., add data. When data is both deleted and added, the data operation is OVERWRITE.
RowDelta can be used for OVERWRITE operations -- it is intended for row-level deletes and updates -- but it is not well-suited to this use case and using it can lead to validation errors that I do not want to deal with.
Therefore I introduce a new API.

... and add a negative test.

wypoon added 2 commits January 23, 2025 11:11

Spark action to remove missing files

d4c6a1d

Add unit tests

0565370

github-actions bot added API spark core labels Jan 25, 2025

Change result to return locations (strings) rather than files

40f5784

wypoon changed the title ~~[Draft] Spark: Action to remove missing files~~ Spark: Action to remove missing files Jan 27, 2025

gaborkaszab reviewed Jan 28, 2025

View reviewed changes

wypoon added 2 commits January 28, 2025 16:42

Check file status in executors.

ec7f08c

Slight code deduplication

8ef186d

... and add a negative test.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark: Action to remove missing files #12106

Spark: Action to remove missing files #12106

wypoon commented Jan 25, 2025

gaborkaszab left a comment

gaborkaszab Jan 28, 2025

gaborkaszab Jan 28, 2025

wypoon Jan 28, 2025

gaborkaszab Jan 28, 2025

wypoon Jan 28, 2025

gaborkaszab Jan 28, 2025

gaborkaszab Jan 28, 2025

gaborkaszab Jan 28, 2025

gaborkaszab Jan 28, 2025

gaborkaszab Jan 28, 2025

wypoon Jan 28, 2025 •

edited

Loading

wypoon Jan 29, 2025

wypoon commented Jan 28, 2025

Spark: Action to remove missing files #12106

Are you sure you want to change the base?

Spark: Action to remove missing files #12106

Conversation

wypoon commented Jan 25, 2025

gaborkaszab left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wypoon Jan 28, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wypoon commented Jan 28, 2025

wypoon Jan 28, 2025 •

edited

Loading