Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Delay reading from parquet file when creating table and column location #6590

Open
wants to merge 9 commits into
base: rc/v0.37.x
Choose a base branch
from

Conversation

malhotrashivam
Copy link
Contributor

@malhotrashivam malhotrashivam added feature request New feature or request parquet Related to the Parquet integration NoDocumentationNeeded ReleaseNotesNeeded Release notes are needed labels Jan 23, 2025
@malhotrashivam malhotrashivam added this to the 0.38.0 milestone Jan 23, 2025
@malhotrashivam malhotrashivam self-assigned this Jan 23, 2025
@malhotrashivam malhotrashivam changed the base branch from main to release/v0.37.4 January 23, 2025 21:35
@malhotrashivam malhotrashivam changed the title DH-18174: Delay reading from parquet file when creating table and column location feat: Delay reading from parquet file when creating table and column location Jan 23, 2025
@malhotrashivam malhotrashivam marked this pull request as draft January 23, 2025 21:36
@@ -90,21 +98,25 @@ public LivenessReferent asLivenessReferent() {
@Override
@NotNull
public final Object getStateLock() {
initialize();
Copy link
Contributor Author

@malhotrashivam malhotrashivam Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if these initialize calls should be put here or in ParquetTableLocation

  • Adding it here makes the entire delayed initialization logic more generic and extensive for other table locations too.
  • Putting it in ParquetTableLocation would require each public method of ParquetTableLocation to have an initialize call , which seems excessive. But then the initialize logic is ParquetTableLocation specific, so I do want to put it there and not touch this class.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to initialize on the TableLocationState methods?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it need to be invoked in any of the "final" methods that AbstractTableLocation defines?

I see it called in handleUpdate. Is it necessary there? If yes, then this sort of abstraction either needs to be in one of

  1. AbstractTableLocation
  2. new abstraction based off of TableLocation
  3. in ParquetTableLocation which implements TableLocation (but not AbstractTableLocation)

These uncertainties make me want to push the initialization logic into ParquetTableLocation, which I'm assuming should be possible even if it extends AbstractTableLocation?

@malhotrashivam malhotrashivam changed the base branch from release/v0.37.4 to rc/v0.37.x January 24, 2025 17:45
@@ -90,21 +98,25 @@ public LivenessReferent asLivenessReferent() {
@Override
@NotNull
public final Object getStateLock() {
initialize();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to initialize on the TableLocationState methods?

@@ -90,21 +98,25 @@ public LivenessReferent asLivenessReferent() {
@Override
@NotNull
public final Object getStateLock() {
initialize();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it need to be invoked in any of the "final" methods that AbstractTableLocation defines?

I see it called in handleUpdate. Is it necessary there? If yes, then this sort of abstraction either needs to be in one of

  1. AbstractTableLocation
  2. new abstraction based off of TableLocation
  3. in ParquetTableLocation which implements TableLocation (but not AbstractTableLocation)

These uncertainties make me want to push the initialization logic into ParquetTableLocation, which I'm assuming should be possible even if it extends AbstractTableLocation?

Comment on lines -261 to +300
public ColumnChunkPageStore<ATTR>[] getPageStores(
private ColumnChunkPageStore<ATTR>[] getPageStores(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class is package private and these methods are public and not used anywhere else in the repo, so made them private.

@malhotrashivam malhotrashivam marked this pull request as ready for review January 27, 2025 20:27
@@ -94,12 +94,12 @@ public final Object getStateLock() {
}

@Override
public final RowSet getRowSet() {
public RowSet getRowSet() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are missing the need to override getLastModifiedTimeMillis in this regime.

That said, I think this is a case where we do want to provide a protected method for callers to override, but only for the TableLocationState methods. This allows us to leave these implementations final. Something like

    // ------------------------------------------------------------------------------------------------------------------
    // TableLocationState implementation
    // ------------------------------------------------------------------------------------------------------------------

    protected void initializeState() {
        
    }
    
    @Override
    @NotNull
    public final Object getStateLock() {
        initializeState();
        return state.getStateLock();
    }

    @Override
    public final RowSet getRowSet() {
        initializeState();
        return state.getRowSet();
    }

    @Override
    public final long getSize() {
        initializeState();
        return state.getSize();
    }

    @Override
    public final long getLastModifiedTimeMillis() {
        initializeState();
        return state.getLastModifiedTimeMillis();
    }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And then it is just a shim into your normal initialize.

    @Override
    protected void initializeState() {
        initialize();
    }

Comment on lines +67 to +68
private boolean isInitialized;
private volatile boolean isInitializedVolatile;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why you are using a pattern that involves non-vol + vol. I doubt there is a big performance consideration; and if there was, I would want to investigate further.

this.isInitializedVolatile = false;
}

private void initialize() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be more explicit here since we have a two-stage initialization from what I can tell. I would call this initializeReaders, or something like that. And then, instead of calling it from a bunch of different places, we should only need to call it in the methods that actually wants to read columnChunkReaders - that is, exists and fetchValues.

if (isInitialized) {
return;
}
tl().initialize();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why it is our responsibility to call tl().initialize(); this should be handled internally by the implementation IMO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a crutch; tl().getRowGroupReaders() should be calling initialize if it needs to access rowGroupIndices.

@@ -58,7 +60,13 @@ final class ParquetColumnLocation<ATTR extends Values> extends AbstractColumnLoc
private static final int MAX_PAGE_CACHE_SIZE = Configuration.getInstance()
.getIntegerForClassWithDefault(ParquetColumnLocation.class, "maxPageCacheSize", 8192);

private final ParquetTableLocation parquetTableLocation;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure it's worth re-storing just so you can get the specific type; casting in tl() seems good enough to me.

Comment on lines 291 to 293
if (tableInfo == null) {
return null;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not your code... but can tableInfo == null ever be true in this case?

Comment on lines +61 to +74
private ParquetFileReader parquetFileReader;
private int[] rowGroupIndices;

private final RowGroup[] rowGroups;
private final RegionedPageStore.Parameters regionParameters;
private final Map<String, String[]> parquetColumnNameToPath;
private RowGroup[] rowGroups;
private RegionedPageStore.Parameters regionParameters;
private Map<String, String[]> parquetColumnNameToPath;

private final TableInfo tableInfo;
private final Map<String, GroupingColumnInfo> groupingColumns;
private final List<DataIndexInfo> dataIndexes;
private final Map<String, ColumnTypeInfo> columnTypes;
private final List<SortColumn> sortingColumns;
private TableInfo tableInfo;
private Map<String, GroupingColumnInfo> groupingColumns;
private List<DataIndexInfo> dataIndexes;
private Map<String, ColumnTypeInfo> columnTypes;
private List<SortColumn> sortingColumns;

private final String version;
private String version;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm tracing down this code more thoroughly... I want to make sure we group all the members that need initialization together. It's a bit obvious in this case, but probably worthwhile to be explicit with something like:

    private volatile boolean isInitialized;
    
    // Access to all the following variables must be guarded by initialize()    
    private ParquetFileReader parquetFileReader;
    private int[] rowGroupIndices;
...
    private String version;    
    // -----------------------------------------------------------------

private final RowGroup[] rowGroups;
private final RegionedPageStore.Parameters regionParameters;
private final Map<String, String[]> parquetColumnNameToPath;
private RowGroup[] rowGroups;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A deeper dive into this code has revealed some things that we don't really need to store IMO; for example, rowGroups is only accessed in computeIndex which is only called once during initialize.

Let's audit every single member variable to make sure:

  1. It's properly guarded by an initialize()
  2. It actually needs to exist as a member variable

Comment on lines +282 to +285
public final BasicDataIndex getDataIndex(@NotNull final String... columns) {
initialize();
return super.getDataIndex(columns);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a poorly scoped initialize; it is not clear what member variables this is guarding. Assuming the member variables are guarded correctly, anything that super.getDataIndex calls should already be properly guarded.

@@ -222,7 +222,7 @@ private BasicDataIndex getDataIndex() {

@Override
@Nullable
public final BasicDataIndex getDataIndex(@NotNull final String... columns) {
public BasicDataIndex getDataIndex(@NotNull final String... columns) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in the implementation, the implementation should not need to override this to properly initialize itself. We should leave this final.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request NoDocumentationNeeded parquet Related to the Parquet integration ReleaseNotesNeeded Release notes are needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants