Skip to content

make inputs of AsymmetricJoinSizer spillable for non kudo cases #12418

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: branch-25.06
Choose a base branch
from

Conversation

binmahone
Copy link
Collaborator

@binmahone binmahone commented Mar 31, 2025

quote issue descrption from #12417:

In https://github.com/NVIDIA/spark-rapids/pull/12354/files#r2015510439 @abellina required to solve the can-not-spill issue for non-kudo input case for AsymmetricJoinSizer, once #12354 is checked in. To address this comment, we need to make non-kudo input for AsymmetricJoinSizer also spillable, so that merging #12354 (even though itself is fixing a bug) will increase the risk of CPU OOM

This PR closes #12417 by modifying SerializedTableColumn's HostMemoryBuffer field to a SpillableHostBuffer field

@binmahone
Copy link
Collaborator Author

build

@firestarman
Copy link
Collaborator

firestarman commented Mar 31, 2025

LGTM but better have more reviews from others.

@binmahone
Copy link
Collaborator Author

LGTM but better have more reviews from others.

thx @firestarman , @abellina can you also pls take a look?

@binmahone binmahone requested review from firestarman and abellina and removed request for firestarman March 31, 2025 09:00
@sameerz sameerz added the task Work required that improves the product but is not user facing label Mar 31, 2025
firestarman
firestarman previously approved these changes Apr 1, 2025
Copy link
Collaborator

@firestarman firestarman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@GaryShen2008 GaryShen2008 added the bug Something isn't working label Apr 1, 2025
@GaryShen2008
Copy link
Collaborator

This PR is together with #12354, which is trying to fix the bug #12353. Label it with bug.

Copy link
Collaborator

@abellina abellina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As both @binmahone and @revans2 pointed out separately, the buffers added here in the host are getting added to the SpillFramework via trackNoSpill. That means they are not going to spill them in the default config we have in 25.04, not until we have a GPU oom and we need to force some host memory out to disk.

Not only that, but making every shuffle spillable is adding more change than what the AsymmetricJoin needs strictly, so we are adding more change that hasn't had enough soak time, and that means we need to not making this change here. We should look into making the change in 25.06 together with turning on host memory limits by default.

JCudfSerialization.concatToHostBuffer(headers, buffers)
withResource(new ArrayBuffer[HostMemoryBuffer]) { buffers =>
val headers = new ArrayBuffer[SerializedTableHeader]
tables.foreach(t => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already have the length of the headers and the buffers, we could create Arrays instead of ArrayBuffer here.

@@ -291,21 +291,33 @@ private class GpuColumnarBatchSerializerInstance(metrics: Map[String, GpuMetric]
*/
class SerializedTableColumn(
val header: SerializedTableHeader,
val hostBuffer: HostMemoryBuffer) extends GpuColumnVectorBase(NullType) with SizeProvider {
val shb: SpillableHostBuffer) extends GpuColumnVectorBase(NullType) with SizeProvider {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, shb -> spillableHostBuffer

@binmahone
Copy link
Collaborator Author

binmahone commented Apr 2, 2025

Offline synced with @abellina , he's still worried that current fix will put too much pressure on the case where we have a really huge build side, and user using default configs. (Even though in my opinion we don't even have confidence such corner cases will work with latest code).

But we do agree that we can put it off to 25.06 to avoid making rash decisions. In 25.06, we will face three options:

option1: keep subpartitionhashjoin and sizedhashjoin as it is, enable offheaplimit by default, and make everything spillable
option2: get rid of sizedhashjoin and focus on subpartitionhashjoin
option3: enhance sizedhashjoin so that when we find the build side cannot be drained in limited batches, we fallback to subparititionhashjoin

we'll need to decide which option to use. @revans2 @abellina

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working task Work required that improves the product but is not user facing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] make inputs of AsymmetricJoinSizer spillable for non kudo cases
5 participants