[Bug] [spark]MemorySegmentUtils#getIntMultiSegments throw ArrayIndexOutOfBoundsException in deserialize CommitMessage #5035

xumanbu · 2025-02-07T08:46:52Z

Search before asking

I searched in the issues and found nothing similar.

Paimon version

paimon-spark-3.5-1.1-20250206.002614-45.jar

Compute Engine

spark 3.4.4
spark 3.5.4

Minimal reproduce step

I have a dataframe with 900 fields and attempted to write it to a Paimon table. A small amount of data writes without errors, but attempting to write over 100,000 records results in an error.

In addition, in this case, the spark stage is successful, only driver report an exception, but when I check the data files in the HDFS path, they exist and have not been cleaned up, it may not expect.

25/02/07 16:24:52 INFO [main] DAGScheduler: Job 4 finished: collect at PaimonSparkWriter.scala:235, took 19.418619 s
25/02/07 16:24:52 INFO [main] CodeGenerator: Code generated in 23.710223 ms
java.lang.ArrayIndexOutOfBoundsException: 39182
  at org.apache.paimon.memory.MemorySegmentUtils.getIntMultiSegments(MemorySegmentUtils.java:685)
  at org.apache.paimon.memory.MemorySegmentUtils.getInt(MemorySegmentUtils.java:675)
  at org.apache.paimon.data.BinaryArray.pointTo(BinaryArray.java:129)
  at org.apache.paimon.memory.MemorySegmentUtils.readArrayData(MemorySegmentUtils.java:1152)
  at org.apache.paimon.data.BinaryRow.getArray(BinaryRow.java:350)
  at org.apache.paimon.io.DataFileMetaSerializer.fromRow(DataFileMetaSerializer.java:84)
  at org.apache.paimon.io.DataFileMetaSerializer.fromRow(DataFileMetaSerializer.java:34)
  at org.apache.paimon.utils.ObjectSerializer.deserialize(ObjectSerializer.java:81)
  at org.apache.paimon.utils.ObjectSerializer.deserializeList(ObjectSerializer.java:104)
  at org.apache.paimon.table.sink.CommitMessageSerializer.lambda$fileDeserializer$0(CommitMessageSerializer.java:135)
  at org.apache.paimon.table.sink.CommitMessageSerializer.deserialize(CommitMessageSerializer.java:124)
  at org.apache.paimon.table.sink.CommitMessageSerializer.deserialize(CommitMessageSerializer.java:103)
  at org.apache.paimon.spark.commands.PaimonSparkWriter.deserializeCommitMessage(PaimonSparkWriter.scala:408)
  at org.apache.paimon.spark.commands.PaimonSparkWriter.$anonfun$write$17(PaimonSparkWriter.scala:237)
  at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
  at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
  at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
  at scala.collection.TraversableLike.map(TraversableLike.scala:286)
  at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
  at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
  at org.apache.paimon.spark.commands.PaimonSparkWriter.write(PaimonSparkWriter.scala:237)
  at org.apache.paimon.spark.commands.WriteIntoPaimonTable.run(WriteIntoPaimonTable.scala:67)
  at org.apache.paimon.spark.SparkWrite.$anonfun$toInsertableRelation$1(SparkWrite.scala:35)
  at org.apache.spark.sql.execution.datasources.v2.SupportsV1Write.writeWithV1(V1FallbackWriters.scala:79)
  at org.apache.spark.sql.execution.datasources.v2.SupportsV1Write.writeWithV1$(V1FallbackWriters.scala:78)
  at org.apache.spark.sql.execution.datasources.v2.AppendDataExecV1.writeWithV1(V1FallbackWriters.scala:34)
  at org.apache.spark.sql.execution.datasources.v2.V1FallbackWriters.run(V1FallbackWriters.scala:66)
  at org.apache.spark.sql.execution.datasources.v2.V1FallbackWriters.run$(V1FallbackWriters.scala:65)
  at org.apache.spark.sql.execution.datasources.v2.AppendDataExecV1.run(V1FallbackWriters.scala:34)
  at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:43)
  at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:43)
  at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:49)
  at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:107)
  at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:125)
  at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:201)
  at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:108)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:66)
  at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:107)
  at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:461)
  at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:76)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:461)
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:32)
  at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
  at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32)
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:437)
  at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:98)
  at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:85)
  at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:83)
  at org.apache.spark.sql.Dataset.<init>(Dataset.scala:220)
  at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
  at org.apache.spark.sql.SparkSession.$anonfun$sql$4(SparkSession.scala:691)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:682)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:713)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:744)
  at insertToUserLake(<console>:31)
  ... 47 elided

What doesn't meet your expectations?

write success.
job failed while deserializeCommitMessage and clean all read write files.

Anything else?

No response

Are you willing to submit a PR?

I'm willing to submit a PR!

The text was updated successfully, but these errors were encountered:

yangjf2019 · 2025-02-07T09:32:31Z

Hi @xumanbu ,Thanks for your issue! and could you please provide more information about this problem? I wanna to know the 900 fileds column type.

xumanbu · 2025-02-07T12:47:46Z

@yangjf2019 Thank you for your reply. It primarily includes the following types: BIGINT, FLOAT, and ARRAY<BIGINT> , with each number being approximately average.

xumanbu added the bug Something isn't working label Feb 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] [spark]MemorySegmentUtils#getIntMultiSegments throw ArrayIndexOutOfBoundsException in deserialize CommitMessage #5035

[Bug] [spark]MemorySegmentUtils#getIntMultiSegments throw ArrayIndexOutOfBoundsException in deserialize CommitMessage #5035

xumanbu commented Feb 7, 2025

yangjf2019 commented Feb 7, 2025

xumanbu commented Feb 7, 2025 •

edited

Loading

[Bug] [spark]MemorySegmentUtils#getIntMultiSegments throw ArrayIndexOutOfBoundsException in deserialize CommitMessage #5035

[Bug] [spark]MemorySegmentUtils#getIntMultiSegments throw ArrayIndexOutOfBoundsException in deserialize CommitMessage #5035

Comments

xumanbu commented Feb 7, 2025

Search before asking

Paimon version

Compute Engine

Minimal reproduce step

What doesn't meet your expectations?

Anything else?

Are you willing to submit a PR?

yangjf2019 commented Feb 7, 2025

xumanbu commented Feb 7, 2025 • edited Loading

xumanbu commented Feb 7, 2025 •

edited

Loading