Introduce kudo reader. #2578

liurenjie1024 · 2024-11-07T06:16:42Z

This pr is part of #2532 , which introduces reader part of kudo format.

Signed-off-by: liurenjie1024 <[email protected]>

liurenjie1024 · 2024-11-07T09:28:47Z

Blocked by rapidsai/cudf#17265

Signed-off-by: liurenjie1024 <[email protected]>

src/main/java/com/nvidia/spark/rapids/jni/kudo/ColumnOffsetInfo.java

src/main/java/com/nvidia/spark/rapids/jni/kudo/ColumnViewInfo.java

src/main/java/com/nvidia/spark/rapids/jni/kudo/KudoHostMergeResult.java

src/main/java/com/nvidia/spark/rapids/jni/kudo/ColumnOffsetInfo.java

src/main/java/com/nvidia/spark/rapids/jni/kudo/KudoSerializer.java

jlowe · 2024-11-07T22:46:39Z

src/main/java/com/nvidia/spark/rapids/jni/kudo/KudoTableMerger.java

+
+        // Update destination byte with the bits from source byte
+        destByte = (byte) ((destByte | (srcByte << curDestBitIdx)) & 0xFF);
+        dest.setByte(curDestByteIdx, destByte);


Same comment as #2532 (comment). I'm OK if this is done as a followup issue for performance, but I would expect it to be approx 4x faster walking this word-by-word with an single-instruction for bitcount rather than a byte-by-byte loop and a table lookup.

I perfer to do this optimization in a follow up issue: #2579, so that we have full tests covered and some benmarks for measuring the benefits.

jlowe · 2024-11-07T22:47:24Z

src/main/java/com/nvidia/spark/rapids/jni/kudo/KudoTableMerger.java

+    int curDestByteIdx = startBit / 8;
+    int curDestBitIdx = startBit % 8;
+
+    while (curIdx < totalRowCount) {


Same comment as #2532 (comment)

See discussion above.

src/main/java/com/nvidia/spark/rapids/jni/kudo/KudoTableMerger.java

Signed-off-by: liurenjie1024 <[email protected]>

abellina · 2024-11-08T15:07:07Z

I will review this today as well

src/main/java/com/nvidia/spark/rapids/jni/kudo/KudoTableMerger.java

src/main/java/com/nvidia/spark/rapids/jni/kudo/KudoHostMergeResult.java

src/main/java/com/nvidia/spark/rapids/jni/kudo/MergedInfoCalc.java

src/main/java/com/nvidia/spark/rapids/jni/kudo/ColumnViewInfo.java

src/main/java/com/nvidia/spark/rapids/jni/kudo/TableBuilder.java

jlowe · 2024-11-08T17:34:11Z

src/main/java/com/nvidia/spark/rapids/jni/kudo/MultiKudoTableVisitor.java

+    private final long[] currentOffsetOffsets;
+    private final long[] currentDataOffset;
+    private final Deque<SliceInfo>[] sliceInfoStack;
+    private final Deque<Long> totalRowCountStack;


We cannot support a row count larger than an int. I'm fine while we're accumulating to use a long to capture overflow conditions (e.g.: within the constructor when it accumulates and checks for overflow). I see no reason to track computed total rows that are larger than an int because the overflow check should have already happened in the constructor and in updateOffsets which is missing an overflow check.

Signed-off-by: liurenjie1024 <[email protected]>

…ader

liurenjie1024 · 2024-11-11T09:29:15Z

build

liurenjie1024 · 2024-11-12T01:50:15Z

The blossom ci failure is know issue, due to failure sync of cudf.

…ader

liurenjie1024 · 2024-11-12T08:31:34Z

build

pxLi · 2024-11-12T09:12:58Z

build

pxLi · 2024-11-12T09:14:52Z

re-kicked, previous trigger failed due to github server-side reset the connection

Received RST_STREAM: Stream not processed

jlowe

Minor nit but otherwise lgtm. Would be good to hear from @abellina before merging since he mentioned he intended to review.

jlowe · 2024-11-12T16:08:49Z

src/main/java/com/nvidia/spark/rapids/jni/kudo/KudoTableMerger.java

+
+          startRow += sliceInfo.getRowCount();
+        }
+        return toIntExact(nullCountTotal);


nullCountTotal is already an int, so this is wasteful. There should already be a row count overflow check elsewhere (and we cannot overflow nullcounts without also overflowing row counts), so seems like this should simply return nullCountTotal directly.

abellina

I'd still like to do more passes, but don't block for me.

abellina · 2024-11-13T22:19:33Z

src/main/java/com/nvidia/spark/rapids/jni/kudo/KudoHostMergeResult.java

+
+  KudoHostMergeResult(Schema schema, HostMemoryBuffer hostBuf, List<ColumnViewInfo> columnInfoList) {
+    requireNonNull(schema, "schema is null");
+    requireNonNull(columnInfoList, "columnOffsets is null");


Suggested change

requireNonNull(columnInfoList, "columnOffsets is null");

requireNonNull(columnInfoList, "columnInfoList is null");

abellina · 2024-11-13T22:20:52Z

src/main/java/com/nvidia/spark/rapids/jni/kudo/KudoHostMergeResult.java

+    if (hostBuf != null) {
+      hostBuf.close();
+    }


This close() checks for null, but it doesn't null out hostBuff.

Suggested change

if (hostBuf != null) {

hostBuf.close();

}

if (hostBuf != null) {

hostBuf.close();

hostBuf = null;

}

I just realized that we have checked non null in constructor.

abellina · 2024-11-13T22:22:56Z

src/main/java/com/nvidia/spark/rapids/jni/kudo/KudoHostMergeResult.java

+        Cuda.DEFAULT_STREAM.sync();
+        return t;
+      } catch (Exception e) {
+        throw new RuntimeException(e);


why do we need to wrap this in a RuntimeException?

So that we could use a concise syntax here:

spark-rapids-jni/src/main/java/com/nvidia/spark/rapids/jni/kudo/KudoSerializer.java

Line 286 in 73b9ac9

builder::convertToTableTime);

abellina · 2024-11-13T22:45:16Z

src/main/java/com/nvidia/spark/rapids/jni/kudo/KudoSerializer.java

+
+      return Pair.of(table, builder.build());
+    } catch (Exception e) {
+      throw new RuntimeException(e);


same here. I suppose there's an exception we are throwing here, and the prior case, that we would need to declare in the interface (and it makes the api a little odd?)

It might be nice to add a comment here and the other case about why.

I'm not a big fan of java's checked exception since it makes modern syntax such as lambda more difficult to use. But I agree that we should add it here since it's entrance of a public api, so it makes since to remind user the exceptions.

abellina · 2024-11-13T22:58:44Z

src/main/java/com/nvidia/spark/rapids/jni/kudo/MultiKudoTableVisitor.java

+      this.sliceInfoStack[i] = new ArrayDeque<>(16);
+      this.sliceInfoStack[i].add(new SliceInfo(header.getOffset(), header.getNumRows()));
+    }
+    long totalRowCount = tables.stream().mapToLong(t -> t.getHeader().getNumRows()).sum();


nit, you could be keeping track of totalRowCount in the loop above, rather than re-stream here.

abellina · 2024-11-13T23:04:21Z

src/main/java/com/nvidia/spark/rapids/jni/kudo/MultiKudoTableVisitor.java

+
+    T t = doVisit(primitiveType);
+    if (primitiveType.getType().hasOffsets()) {
+      updateOffsets(true, true, false, -1);


could we add comments for each argument here?

updateOffsets(/*updateOffset*/ true, /*updateData*/ true, /*updateSliceInfo*/ false, /*sizeInBytes* -1)

liurenjie1024 · 2024-11-14T05:43:49Z

cc @jlowe @abellina PTAL. If there is not major block issue, could we merge this first so that I can start work of integrating with spark-rapids? Minors issues could be resoved in parallel.

liurenjie1024 · 2024-11-14T06:53:40Z

build

Introduce kudo reader

3164564

Signed-off-by: liurenjie1024 <[email protected]>

liurenjie1024 requested review from jlowe and revans2 November 7, 2024 06:18

liurenjie1024 added 2 commits November 7, 2024 16:23

Add tests

344ee35

Signed-off-by: liurenjie1024 <[email protected]>

Fix column view

c48f813

Signed-off-by: liurenjie1024 <[email protected]>

Fix build break

73ce8a1

Signed-off-by: liurenjie1024 <[email protected]>

jlowe reviewed Nov 7, 2024

View reviewed changes

liurenjie1024 added 2 commits November 8, 2024 14:39

Fix build comments

73b9ac9

Signed-off-by: liurenjie1024 <[email protected]>

Remove unused case

f39e1a0

Signed-off-by: liurenjie1024 <[email protected]>

abellina self-requested a review November 8, 2024 15:06

jlowe reviewed Nov 8, 2024

View reviewed changes

liurenjie1024 added 5 commits November 11, 2024 11:16

Fix comments

3aa7f56

Signed-off-by: liurenjie1024 <[email protected]>

Fix SchemaVisitor

a2794f2

Signed-off-by: liurenjie1024 <[email protected]>

Merge remote-tracking branch 'upstream/branch-24.12' into ray/kudo-re…

b932788

…ader

Fix tests

a396077

Update cudf

efb4a0c

Merge remote-tracking branch 'upstream/branch-24.12' into ray/kudo-re…

27d15ee

…ader

jlowe previously approved these changes Nov 12, 2024

View reviewed changes

abellina reviewed Nov 13, 2024

View reviewed changes

Fix comments

a33eb89

liurenjie1024 dismissed jlowe’s stale review via a33eb89 November 14, 2024 05:42

abellina approved these changes Nov 14, 2024

View reviewed changes

jlowe approved these changes Nov 14, 2024

View reviewed changes

jlowe merged commit e5e1603 into NVIDIA:branch-24.12 Nov 14, 2024
3 checks passed

liurenjie1024 deleted the ray/kudo-reader branch November 15, 2024 01:45

liurenjie1024 restored the ray/kudo-reader branch November 15, 2024 07:53

sameerz added the performance label Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce kudo reader. #2578

Introduce kudo reader. #2578

liurenjie1024 commented Nov 7, 2024

liurenjie1024 commented Nov 7, 2024

jlowe Nov 7, 2024

liurenjie1024 Nov 8, 2024

jlowe Nov 7, 2024

liurenjie1024 Nov 8, 2024

abellina commented Nov 8, 2024

jlowe Nov 8, 2024

liurenjie1024 commented Nov 11, 2024

liurenjie1024 commented Nov 12, 2024

liurenjie1024 commented Nov 12, 2024

pxLi commented Nov 12, 2024

pxLi commented Nov 12, 2024

jlowe left a comment

jlowe Nov 12, 2024

liurenjie1024 Nov 14, 2024

abellina left a comment

abellina Nov 13, 2024

liurenjie1024 Nov 14, 2024

abellina Nov 13, 2024

liurenjie1024 Nov 14, 2024

liurenjie1024 Nov 14, 2024

abellina Nov 13, 2024

liurenjie1024 Nov 14, 2024

abellina Nov 13, 2024

liurenjie1024 Nov 14, 2024

liurenjie1024 Nov 14, 2024

abellina Nov 13, 2024

liurenjie1024 Nov 14, 2024

abellina Nov 13, 2024

liurenjie1024 Nov 14, 2024

liurenjie1024 commented Nov 14, 2024

liurenjie1024 commented Nov 14, 2024

	requireNonNull(columnInfoList, "columnOffsets is null");
	requireNonNull(columnInfoList, "columnInfoList is null");

Introduce kudo reader. #2578

Introduce kudo reader. #2578

Conversation

liurenjie1024 commented Nov 7, 2024

liurenjie1024 commented Nov 7, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abellina commented Nov 8, 2024

Choose a reason for hiding this comment

liurenjie1024 commented Nov 11, 2024

liurenjie1024 commented Nov 12, 2024

liurenjie1024 commented Nov 12, 2024

pxLi commented Nov 12, 2024

pxLi commented Nov 12, 2024

jlowe left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abellina left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liurenjie1024 commented Nov 14, 2024

liurenjie1024 commented Nov 14, 2024