-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Delay reading from parquet file when creating table and column location #6590
base: rc/v0.37.x
Are you sure you want to change the base?
feat: Delay reading from parquet file when creating table and column location #6590
Conversation
@@ -90,21 +98,25 @@ public LivenessReferent asLivenessReferent() { | |||
@Override | |||
@NotNull | |||
public final Object getStateLock() { | |||
initialize(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if these initialize calls should be put here or in ParquetTableLocation
- Adding it here makes the entire delayed initialization logic more generic and extensive for other table locations too.
- Putting it in ParquetTableLocation would require each public method of ParquetTableLocation to have an
initialize
call , which seems excessive. But then the initialize logic is ParquetTableLocation specific, so I do want to put it there and not touch this class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to initialize on the TableLocationState
methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it need to be invoked in any of the "final" methods that AbstractTableLocation defines?
I see it called in handleUpdate
. Is it necessary there? If yes, then this sort of abstraction either needs to be in one of
AbstractTableLocation
- new abstraction based off of
TableLocation
- in ParquetTableLocation which implements TableLocation (but not
AbstractTableLocation
)
These uncertainties make me want to push the initialization logic into ParquetTableLocation, which I'm assuming should be possible even if it extends AbstractTableLocation
?
@@ -90,21 +98,25 @@ public LivenessReferent asLivenessReferent() { | |||
@Override | |||
@NotNull | |||
public final Object getStateLock() { | |||
initialize(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to initialize on the TableLocationState
methods?
...table/src/main/java/io/deephaven/engine/table/impl/locations/impl/AbstractTableLocation.java
Outdated
Show resolved
Hide resolved
@@ -90,21 +98,25 @@ public LivenessReferent asLivenessReferent() { | |||
@Override | |||
@NotNull | |||
public final Object getStateLock() { | |||
initialize(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it need to be invoked in any of the "final" methods that AbstractTableLocation defines?
I see it called in handleUpdate
. Is it necessary there? If yes, then this sort of abstraction either needs to be in one of
AbstractTableLocation
- new abstraction based off of
TableLocation
- in ParquetTableLocation which implements TableLocation (but not
AbstractTableLocation
)
These uncertainties make me want to push the initialization logic into ParquetTableLocation, which I'm assuming should be possible even if it extends AbstractTableLocation
?
public ColumnChunkPageStore<ATTR>[] getPageStores( | ||
private ColumnChunkPageStore<ATTR>[] getPageStores( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The class is package private and these methods are public and not used anywhere else in the repo, so made them private.
@@ -94,12 +94,12 @@ public final Object getStateLock() { | |||
} | |||
|
|||
@Override | |||
public final RowSet getRowSet() { | |||
public RowSet getRowSet() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are missing the need to override getLastModifiedTimeMillis
in this regime.
That said, I think this is a case where we do want to provide a protected method for callers to override, but only for the TableLocationState
methods. This allows us to leave these implementations final. Something like
// ------------------------------------------------------------------------------------------------------------------
// TableLocationState implementation
// ------------------------------------------------------------------------------------------------------------------
protected void initializeState() {
}
@Override
@NotNull
public final Object getStateLock() {
initializeState();
return state.getStateLock();
}
@Override
public final RowSet getRowSet() {
initializeState();
return state.getRowSet();
}
@Override
public final long getSize() {
initializeState();
return state.getSize();
}
@Override
public final long getLastModifiedTimeMillis() {
initializeState();
return state.getLastModifiedTimeMillis();
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And then it is just a shim into your normal initialize.
@Override
protected void initializeState() {
initialize();
}
private boolean isInitialized; | ||
private volatile boolean isInitializedVolatile; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know why you are using a pattern that involves non-vol + vol. I doubt there is a big performance consideration; and if there was, I would want to investigate further.
this.isInitializedVolatile = false; | ||
} | ||
|
||
private void initialize() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should be more explicit here since we have a two-stage initialization from what I can tell. I would call this initializeReaders
, or something like that. And then, instead of calling it from a bunch of different places, we should only need to call it in the methods that actually wants to read columnChunkReaders
- that is, exists
and fetchValues
.
if (isInitialized) { | ||
return; | ||
} | ||
tl().initialize(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why it is our responsibility to call tl().initialize()
; this should be handled internally by the implementation IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a crutch; tl().getRowGroupReaders()
should be calling initialize if it needs to access rowGroupIndices
.
@@ -58,7 +60,13 @@ final class ParquetColumnLocation<ATTR extends Values> extends AbstractColumnLoc | |||
private static final int MAX_PAGE_CACHE_SIZE = Configuration.getInstance() | |||
.getIntegerForClassWithDefault(ParquetColumnLocation.class, "maxPageCacheSize", 8192); | |||
|
|||
private final ParquetTableLocation parquetTableLocation; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure it's worth re-storing just so you can get the specific type; casting in tl()
seems good enough to me.
if (tableInfo == null) { | ||
return null; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not your code... but can tableInfo == null
ever be true in this case?
private ParquetFileReader parquetFileReader; | ||
private int[] rowGroupIndices; | ||
|
||
private final RowGroup[] rowGroups; | ||
private final RegionedPageStore.Parameters regionParameters; | ||
private final Map<String, String[]> parquetColumnNameToPath; | ||
private RowGroup[] rowGroups; | ||
private RegionedPageStore.Parameters regionParameters; | ||
private Map<String, String[]> parquetColumnNameToPath; | ||
|
||
private final TableInfo tableInfo; | ||
private final Map<String, GroupingColumnInfo> groupingColumns; | ||
private final List<DataIndexInfo> dataIndexes; | ||
private final Map<String, ColumnTypeInfo> columnTypes; | ||
private final List<SortColumn> sortingColumns; | ||
private TableInfo tableInfo; | ||
private Map<String, GroupingColumnInfo> groupingColumns; | ||
private List<DataIndexInfo> dataIndexes; | ||
private Map<String, ColumnTypeInfo> columnTypes; | ||
private List<SortColumn> sortingColumns; | ||
|
||
private final String version; | ||
private String version; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm tracing down this code more thoroughly... I want to make sure we group all the members that need initialization together. It's a bit obvious in this case, but probably worthwhile to be explicit with something like:
private volatile boolean isInitialized;
// Access to all the following variables must be guarded by initialize()
private ParquetFileReader parquetFileReader;
private int[] rowGroupIndices;
...
private String version;
// -----------------------------------------------------------------
private final RowGroup[] rowGroups; | ||
private final RegionedPageStore.Parameters regionParameters; | ||
private final Map<String, String[]> parquetColumnNameToPath; | ||
private RowGroup[] rowGroups; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A deeper dive into this code has revealed some things that we don't really need to store IMO; for example, rowGroups
is only accessed in computeIndex
which is only called once during initialize
.
Let's audit every single member variable to make sure:
- It's properly guarded by an
initialize()
- It actually needs to exist as a member variable
public final BasicDataIndex getDataIndex(@NotNull final String... columns) { | ||
initialize(); | ||
return super.getDataIndex(columns); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a poorly scoped initialize
; it is not clear what member variables this is guarding. Assuming the member variables are guarded correctly, anything that super.getDataIndex
calls should already be properly guarded.
@@ -222,7 +222,7 @@ private BasicDataIndex getDataIndex() { | |||
|
|||
@Override | |||
@Nullable | |||
public final BasicDataIndex getDataIndex(@NotNull final String... columns) { | |||
public BasicDataIndex getDataIndex(@NotNull final String... columns) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in the implementation, the implementation should not need to override this to properly initialize itself. We should leave this final.
https://deephaven.atlassian.net/browse/DH-18174