Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rebase #1

Merged
merged 10,000 commits into from
Aug 24, 2023
Merged

rebase #1

merged 10,000 commits into from
Aug 24, 2023

Conversation

Benyuel
Copy link
Owner

@Benyuel Benyuel commented Aug 24, 2023

No description provided.

HyukjinKwon and others added 30 commits August 5, 2023 00:58
…est didn't reach server in Python client

### What changes were proposed in this pull request?

The fix for the symmetry to #42282.

### Why are the changes needed?

See also #42282

### Does this PR introduce _any_ user-facing change?

See also #42282

### How was this patch tested?

See also #42282

Closes #42338 from HyukjinKwon/SPARK-44671.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
…ckling errors

### What changes were proposed in this pull request?

This PR improves the error messages when a Python UDTF failed to pickle.

### Why are the changes needed?

To make the error message more user-friendly

### Does this PR introduce _any_ user-facing change?

Yes, before this PR, when a UDTF fails to pickle, it throws this confusing exception:
```
_pickle.PicklingError: Cannot pickle files that are not opened for reading: w
```
After this PR, the error is more clear:
`[UDTF_SERIALIZATION_ERROR] Cannot serialize the UDTF 'TestUDTF': Please check the stack trace and make sure that the function is serializable.`

And for spark session access inside a UDTF:
`[UDTF_SERIALIZATION_ERROR] it appears that you are attempting to reference SparkSession inside a UDTF. SparkSession can only be used on the driver, not in code that runs on workers. Please remove the reference and try again.`

### How was this patch tested?

New UTs.

Closes #42309 from allisonwang-db/spark-44644-pickling.

Authored-by: allisonwang-db <[email protected]>
Signed-off-by: Takuya UESHIN <[email protected]>
…n UDTFs

### What changes were proposed in this pull request?

This PR disables arrow optimization by default for Python UDTFs.

### Why are the changes needed?

To make Python UDTFs consistent with Python UDFs (arrow optimization is by default disabled).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New unit tests

Closes #42329 from allisonwang-db/spark-44663-disable-arrow.

Authored-by: allisonwang-db <[email protected]>
Signed-off-by: Takuya UESHIN <[email protected]>
### What changes were proposed in this pull request?
Uninstall large ML libraries for non-ML jobs

### Why are the changes needed?
ML is integrating external frameworks: torch, deepspeed (maybe xgboost in future)
those libraries are huge, and not needed in other jobs.

this PR uninstall torch, which save ~1.3G

![image](https://github.com/apache/spark/assets/7322292/e8181924-ca30-4e1e-8808-659f6a75c1d1)

### Does this PR introduce _any_ user-facing change?
no, infra-only

### How was this patch tested?
updated CI

Closes #42334 from zhengruifeng/infra_uninstall_torch.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
…r creating streaming python processes

### What changes were proposed in this pull request?

Followup of this comment: #42283 (comment)
Change back the spark conf after creating streaming python process.

### Why are the changes needed?

Bug fix

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Config only change

Closes #42341 from WweiL/SPARK-44433-followup-USEDAEMON.

Authored-by: Wei Liu <[email protected]>
Signed-off-by: Takuya UESHIN <[email protected]>
…merged

### What changes were proposed in this pull request?
This pr aims add a new `ProblemFilters` to `MimaExcludes.scala` to fix mima check for Scala 2.13 after SPARK-44198 merged.

### Why are the changes needed?
Scala 2.13's daily tests have been failing the mima check for several days:
- https://github.com/apache/spark/actions/runs/5765663964

<img width="1194" alt="image" src="https://github.com/apache/spark/assets/1475305/7b73aa0d-3e19-4119-bdbe-627f1c715d2b">

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- Pass GitHub Actions
- Manual verification:

1.  The mima check was passing before SPARK-44198.
```
// [SPARK-44425][CONNECT] Validate that user provided sessionId is an UUID
git reset --hard a3bd477
dev/change-scala-version.sh 2.13
dev/mima -Pscala-2.13
```

```
[success] Total time: 129 s (02:09), completed 2023-8-5 14:21:06
```

2. The mima check failed after SPARK-44198 was merged

```
// [SPARK-44198][CORE] Support propagation of the log level to the executors
git reset --hard 5fc90fb
dev/change-scala-version.sh 2.13
dev/mima -Pscala-2.13
```
```
[error] spark-core: Failed binary compatibility check against org.apache.spark:spark-core_2.13:3.4.0! Found 1 potential problems (filtered 4013)
[error]  * the type hierarchy of object org.apache.spark.scheduler.cluster.CoarseGrainedClusterMessages#SparkAppConfig is different in current version. Missing types {scala.runtime.AbstractFunction4}
[error]    filter with: ProblemFilters.exclude[MissingTypesProblem]("org.apache.spark.scheduler.cluster.CoarseGrainedClusterMessages$SparkAppConfig$")
[error] java.lang.RuntimeException: Failed binary compatibility check against org.apache.spark:spark-core_2.13:3.4.0! Found 1 potential problems (filtered 4013)
[error] 	at scala.sys.package$.error(package.scala:30)
[error] 	at com.typesafe.tools.mima.plugin.SbtMima$.reportModuleErrors(SbtMima.scala:89)
[error] 	at com.typesafe.tools.mima.plugin.MimaPlugin$.$anonfun$projectSettings$2(MimaPlugin.scala:36)
[error] 	at com.typesafe.tools.mima.plugin.MimaPlugin$.$anonfun$projectSettings$2$adapted(MimaPlugin.scala:26)
[error] 	at scala.collection.Iterator.foreach(Iterator.scala:943)
[error] 	at scala.collection.Iterator.foreach$(Iterator.scala:943)
[error] 	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
[error] 	at com.typesafe.tools.mima.plugin.MimaPlugin$.$anonfun$projectSettings$1(MimaPlugin.scala:26)
[error] 	at com.typesafe.tools.mima.plugin.MimaPlugin$.$anonfun$projectSettings$1$adapted(MimaPlugin.scala:25)
[error] 	at scala.Function1.$anonfun$compose$1(Function1.scala:49)
[error] 	at sbt.internal.util.$tilde$greater.$anonfun$$u2219$1(TypeFunctions.scala:63)
[error] 	at sbt.std.Transform$$anon$4.work(Transform.scala:69)
[error] 	at sbt.Execute.$anonfun$submit$2(Execute.scala:283)
[error] 	at sbt.internal.util.ErrorHandling$.wideConvert(ErrorHandling.scala:24)
[error] 	at sbt.Execute.work(Execute.scala:292)
[error] 	at sbt.Execute.$anonfun$submit$1(Execute.scala:283)
[error] 	at sbt.ConcurrentRestrictions$$anon$4.$anonfun$submitValid$1(ConcurrentRestrictions.scala:265)
[error] 	at sbt.CompletionService$$anon$2.call(CompletionService.scala:65)
[error] 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[error] 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
[error] 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[error] 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[error] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[error] 	at java.lang.Thread.run(Thread.java:750)
[error] (core / mimaReportBinaryIssues) Failed binary compatibility check against org.apache.spark:spark-core_2.13:3.4.0! Found 1 potential problems (filtered 4013)
[error] Total time: 82 s (01:22), completed 2023-8-5 14:23:49
```
3. with this pr, mima check pass

```
gh pr checkout 42358
dev/change-scala-version.sh 2.13
dev/mima -Pscala-2.13
```

```
[success] Total time: 157 s (02:37), completed 2023-8-5 14:31:05
```

Closes #42358 from LuciferYang/SPARK-44687.

Authored-by: yangjie01 <[email protected]>
Signed-off-by: yangjie01 <[email protected]>
…e_disk_space` and `free_disk_space_container`

### What changes were proposed in this pull request?
This pr add a file existence check before executing `dev/free_disk_space` and `dev/free_disk_space_container`

### Why are the changes needed?
We added `free_disk_space` and `free_disk_space_container` to clean up the disk, but because the daily tests of other branches and the master branch share the yml file, we should check if the file exists before execution, otherwise it will affect the daily tests of other branches.

- branch-3.5: https://github.com/apache/spark/actions/runs/5761479443
- branch-3.4: https://github.com/apache/spark/actions/runs/5760423900
- branch-3.3: https://github.com/apache/spark/actions/runs/5759384052

<img width="1073" alt="image" src="https://github.com/apache/spark/assets/1475305/6e46b34b-645a-4da5-b9c3-8a89bfacabcb">

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- Pass GitHub Action

Closes #42359 from LuciferYang/test-free_disk_space-exist.

Authored-by: yangjie01 <[email protected]>
Signed-off-by: yangjie01 <[email protected]>
… java options

### What changes were proposed in this pull request?

### Why are the changes needed?
Command
```bash
 ./bin/spark-shell --conf spark.executor.extraJavaOptions='-Dspark.foo=bar'
```
Error
```
spark.executor.extraJavaOptions is not allowed to set Spark options (was '-Dspark.foo=bar'). Set them directly on a SparkConf or in a properties file when using ./bin/spark-submit.
```

Command
```bash
./bin/spark-shell --conf spark.executor.defaultJavaOptions='-Dspark.foo=bar'
```
Start up normally.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
local test & add UT

```
./bin/spark-shell --conf spark.executor.defaultJavaOptions='-Dspark.foo=bar'
```

```
spark.executor.defaultJavaOptions is not allowed to set Spark options (was '-Dspark.foo=bar'). Set them directly on a SparkConf or in a properties file when using ./bin/spark-submit.
```

Closes #42313 from cxzl25/SPARK-44650.

Authored-by: sychen <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
### What changes were proposed in this pull request?

This PR enhances ConstantPropagation to support more cases.

Propagated through other binary comparisons.
Propagated across equality comparisons. This can be further optimized to false.

### Why are the changes needed?

Improve query performance. [Denodo](https://community.denodo.com/docs/html/browse/latest/en/vdp/administration/optimizing_queries/automatic_simplification_of_queries/removing_redundant_branches_of_queries_partitioned_unions) also has a similar optimization. For example:
```
CREATE TABLE t1(a int, b int) using parquet;
CREATE TABLE t2(x int, y int) using parquet;

CREATE TEMP VIEW v1 AS
SELECT * FROM t1 JOIN t2 WHERE a = x AND a = 0
UNION ALL
SELECT * FROM t1 JOIN t2 WHERE a = x AND (a IS NULL OR a <> 0);

SELECT * FROM v1 WHERE x > 1;
```
Before this PR:
```
== Optimized Logical Plan ==
Union false, false
:- Project [a#0 AS a#12, b#1 AS b#13, x#2 AS x#14, y#3 AS y#15]
:  +- Join Inner
:     :- Filter (isnotnull(a#0) AND (a#0 = 0))
:     :  +- Relation spark_catalog.default.t1[a#0,b#1] parquet
:     +- Filter (isnotnull(x#2) AND ((0 = x#2) AND (x#2 > 1)))
:        +- Relation spark_catalog.default.t2[x#2,y#3] parquet
+- Join Inner, (a#16 = x#18)
   :- Filter ((isnull(a#16) OR NOT (a#16 = 0)) AND ((a#16 > 1) AND isnotnull(a#16)))
   :  +- Relation spark_catalog.default.t1[a#16,b#17] parquet
   +- Filter ((isnotnull(x#18) AND (x#18 > 1)) AND (isnull(x#18) OR NOT (x#18 = 0)))
      +- Relation spark_catalog.default.t2[x#18,y#19] parquet
```
After this PR:
```
== Optimized Logical Plan ==
Join Inner, (a#16 = x#18)
:- Filter ((isnull(a#16) OR NOT (a#16 = 0)) AND ((a#16 > 1) AND isnotnull(a#16)))
:  +- Relation spark_catalog.default.t1[a#16,b#17] parquet
+- Filter ((isnotnull(x#18) AND (x#18 > 1)) AND (isnull(x#18) OR NOT (x#18 = 0)))
   +- Relation spark_catalog.default.t2[x#18,y#19] parquet
```
### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #42038 from TongWei1105/SPARK-42500.

Authored-by: TongWei1105 <[email protected]>
Signed-off-by: Yuming Wang <[email protected]>
…with type arguments

### What changes were proposed in this pull request?
This PR fixes a regression introduced in Spark 3.4.x  where  Encoders.bean is no longer able to process nested beans having type arguments. For example:

```
class A<T> {
   T value;
   // value getter and setter
}

class B {
   A<String> stringHolder;
   // stringHolder getter and setter
}

Encoders.bean(B.class); // throws "SparkUnsupportedOperationException: [ENCODER_NOT_FOUND]..."
```

### Why are the changes needed?
JavaTypeInference.encoderFor main match does not manage ParameterizedType and TypeVariable cases. I think this is a regression introduced after getting rid of usage of guava TypeToken: [SPARK-42093 SQL Move JavaTypeInference to AgnosticEncoders](1867200#diff-1191737b908340a2f4c22b71b1c40ebaa0da9d8b40c958089c346a3bda26943b) hvanhovell cloud-fan

In this PR I'm leveraging commons lang3 TypeUtils functionalities to solve ParameterizedType type arguments for classes

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing tests have been extended to check correct encoding of a nested bean having type arguments.

Closes #42327 from gbloisi-openaire/spark-44634.

Authored-by: Giambattista Bloisi <[email protected]>
Signed-off-by: Herman van Hovell <[email protected]>
### What changes were proposed in this pull request?
This PR adds a webpage to the Spark docs website, https://spark.apache.org/docs, to outline PySpark testing best practices.

### Why are the changes needed?
The changes are needed to provide PySpark end users with a guideline for how to use PySpark utils (introduced in SPARK-44629) to test PySpark code.

### Does this PR introduce _any_ user-facing change?
Yes, the PR publishes a webpage on the Spark website.

### How was this patch tested?
Existing tests

Closes #42284 from asl3/testing-guidelines.

Authored-by: Amanda Liu <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
…terministic order

### What changes were proposed in this pull request?
Method `DataSourceStrategy#selectFilters`, which is used to determine "pushdown-able" filters, does not preserve the order of the input Seq[Expression] nor does it return the same order across the same plans. This is resulting in CodeGenerator cache misses even when the exact same LogicalPlan is executed.

This PR to make sure `selectFilters` returns predicates in deterministic order.

### Why are the changes needed?
Make sure `selectFilters` returns predicates in deterministic order, to reduce the probability of codegen cache misses.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
add new test.

Closes #42265 from Hisoka-X/SPARK-41636_selectfilters_order.

Authored-by: Jia Fan <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
### What changes were proposed in this pull request?
The aim of this PR is to downgrade the Scala 2.13 dependency to 2.13.8 to ensure that Spark can be build with `-target:jvm-1.8`, and tested with Java 11/17.

### Why are the changes needed?
As reported in SPARK-44376, there are issues when maven build and test using Java 11/17 with `-target:jvm-1.8`:

- run `build/mvn clean install -Pscala-2.13` with Java 17

```
[INFO] --- scala-maven-plugin:4.8.0:compile (scala-compile-first)  spark-core_2.13 ---
[INFO] Compiler bridge file: /Users/yangjie01/.sbt/1.0/zinc/org.scala-sbt/org.scala-sbt-compiler-bridge_2.13-1.8.0-bin_2.13.11__61.0-1.8.0_20221110T195421.jar
[INFO] compiling 602 Scala sources and 77 Java sources to /Users/yangjie01/SourceCode/git/spark-mine-13/core/target/scala-2.13/classes ...
[WARNING] [Warn] : [deprecation   | origin= | version=] -target is deprecated: Use -release instead to compile against the correct platform API.
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/serializer/SerializationDebugger.scala:71: not found: value sun
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:26: not found: object sun
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:27: not found: object sun
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:206: not found: type DirectBuffer
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:210: not found: type Unsafe
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:212: not found: type Unsafe
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:213: not found: type DirectBuffer
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:216: not found: type DirectBuffer
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:236: not found: type DirectBuffer
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:26: Unused import
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:27: Unused import
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/util/ClosureCleaner.scala:452: not found: value sun
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:26: not found: object sun
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:99: not found: type SignalHandler
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:99: not found: type Signal
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:83: not found: type Signal
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:108: not found: type SignalHandler
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:108: not found: value Signal
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:114: not found: type Signal
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:116: not found: value Signal
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:128: not found: value Signal
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:26: Unused import
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:26: Unused import
[WARNING] one warning found
[ERROR] 23 errors found

```

- run `build/mvn clean install  -Pscala-2.13 -Djava.version=17` with Java 17

```
[INFO] --- scala-maven-plugin:4.8.0:compile (scala-compile-first)  spark-tags_2.13 ---
[INFO] Compiler bridge file: /Users/yangjie01/.sbt/1.0/zinc/org.scala-sbt/org.scala-sbt-compiler-bridge_2.13-1.8.0-bin_2.13.11__61.0-1.8.0_20221110T195421.jar
[INFO] compiling 2 Scala sources and 8 Java sources to /Users/yangjie01/SourceCode/git/spark-mine-13/common/tags/target/scala-2.13/classes ...
[WARNING] [Warn] : [deprecation   | origin= | version=] -target is deprecated: Use -release instead to compile against the correct platform API.
[ERROR] [Error] : target platform version 8 is older than the release version 17
[WARNING] one warning found
[ERROR] one error found
```
- run `build/mvn clean package -Pscala-2.13 -DskipTests` or `build/mvn clean install -Pscala-2.13 -DskipTests` with Java 8 first, then run `build/mvn test -Pscala-2.13` with Java 17

```
[INFO] --- scala-maven-plugin:4.8.0:compile (scala-compile-first)  spark-core_2.13 ---
[INFO] Compiler bridge file: /Users/yangjie01/.sbt/1.0/zinc/org.scala-sbt/org.scala-sbt-compiler-bridge_2.13-1.8.0-bin_2.13.11__61.0-1.8.0_20221110T195421.jar
[INFO] compiling 602 Scala sources and 77 Java sources to /Users/yangjie01/SourceCode/git/spark-mine-13/core/target/scala-2.13/classes ...
[WARNING] [Warn] : [deprecation   | origin= | version=] -target is deprecated: Use -release instead to compile against the correct platform API.
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/serializer/SerializationDebugger.scala:71: not found: value sun
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:26: not found: object sun
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:27: not found: object sun
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:206: not found: type DirectBuffer
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:210: not found: type Unsafe
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:212: not found: type Unsafe
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:213: not found: type DirectBuffer
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:216: not found: type DirectBuffer
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:236: not found: type DirectBuffer
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:26: Unused import
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:27: Unused import
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/util/ClosureCleaner.scala:452: not found: value sun
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:26: not found: object sun
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:99: not found: type SignalHandler
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:99: not found: type Signal
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:83: not found: type Signal
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:108: not found: type SignalHandler
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:108: not found: value Signal
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:114: not found: type Signal
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:116: not found: value Signal
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:128: not found: value Signal
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:26: Unused import
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:26: Unused import
[WARNING] one warning found
[ERROR] 23 errors found
```

This is inconsistent with the behavior of the released `Apache Spark` version, so we need to use the previous Scala2.13 version to support this behavior.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
- Pass GitHub Actions
- Manual checked, the above command can run normally after this pr

Closes #42364 from LuciferYang/SPARK-44690.

Authored-by: yangjie01 <[email protected]>
Signed-off-by: yangjie01 <[email protected]>
…bject SqlApi`

### What changes were proposed in this pull request?
This PR renames the Setting object used by the `SqlApi` module in `SparkBuild/scala` from `object Catalyst` to `object SqlApi`.

### Why are the changes needed?
The `SqlApi` module should use a more appropriate Setting object name.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- Pass GitHub Actions

Closes #42361 from LuciferYang/rename-catalyst-2-sqlapi.

Authored-by: yangjie01 <[email protected]>
Signed-off-by: yangjie01 <[email protected]>
…st Dockerfile

### What changes were proposed in this pull request?
Added tests to the Dockerfile for tests in OSS Spark CI.

### Why are the changes needed?
They'll skip the deepspeed tests otherwise.

### Does this PR introduce _any_ user-facing change?
Nope, testing infra.

### How was this patch tested?
Running the tests on machine.

Closes #42347 from mathewjacob1002/testing_infra.

Authored-by: Mathew Jacob <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
### What changes were proposed in this pull request?
This PR converts the types for message_parameters for pandas error classes to string, to ensure ability to compare error class messages in tests.

### Why are the changes needed?
The change ensures the ability to compare error class messages in tests.

### Does this PR introduce _any_ user-facing change?
No, the PR does not affect the user-facing view of the error messages.

### How was this patch tested?
Updated `python/pyspark/pandas/tests/test_utils.py` and existing tests

Closes #42348 from asl3/string-pandas-error-types.

Authored-by: Amanda Liu <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
… some common logic

### What changes were proposed in this pull request?
The pr aims to clear some unused codes in "***Errors" and extract some common logic.

### Why are the changes needed?
Make code clear.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA.

Closes #42238 from panbingkun/clear_error.

Authored-by: panbingkun <[email protected]>
Signed-off-by: Max Gekk <[email protected]>
### What changes were proposed in this pull request?
This PR proposes to assign name to _LEGACY_ERROR_TEMP_2133, "CANNOT_PARSE_STRING_AS_DATATYPE".

### Why are the changes needed?
Assign name to _LEGACY_ERROR_TEMP_2133

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
add new test

Closes #42018 from Hisoka-X/SPARK-42321_LEGACY_ERROR_TEMP_2133.

Authored-by: Jia Fan <[email protected]>
Signed-off-by: Max Gekk <[email protected]>
… `torch` is not installed

### What changes were proposed in this pull request?
Skip `ClassificationTestsOnConnect` when `torch` is not installed

### Why are the changes needed?
we moved torch on connect tests to `pyspark_ml_connect`, so module `pyspark_connect` won't have `torch`

to fix https://github.com/apache/spark/actions/runs/5776211318/job/15655104006 in 3.5 daily GA:

```
Starting test(python3.9): pyspark.ml.tests.connect.test_connect_classification (temp output: /__w/spark/spark/python/target/fbb6a495-df65-4334-8c04-4befc9ee81df/python3.9__pyspark.ml.tests.connect.test_connect_classification__jp1htw6f.log)
Traceback (most recent call last):
  File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/__w/spark/spark/python/pyspark/ml/tests/connect/test_connect_classification.py", line 21, in <module>
    from pyspark.ml.tests.connect.test_legacy_mode_classification import ClassificationTestsMixin
  File "/__w/spark/spark/python/pyspark/ml/tests/connect/test_legacy_mode_classification.py", line 22, in <module>
    from pyspark.ml.connect.classification import (
  File "/__w/spark/spark/python/pyspark/ml/connect/classification.py", line 46, in <module>
    import torch
ModuleNotFoundError: No module named 'torch'
```

### Does this PR introduce _any_ user-facing change?
no, test-only

### How was this patch tested?
CI

Closes #42375 from zhengruifeng/torch_skip.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
… Encoders.scala

### What changes were proposed in this pull request?
### Why are the changes needed?
It is currently not possible to create a `RowEncoder` using public API. The internal APIs for this will change in Spark 3.5, this means that library maintainers have to update their code if they use a RowEncoder. To avoid happening again, we add this method to the public API.

### Does this PR introduce _any_ user-facing change?
Yes. It adds the `row` method to `Encoders`.

### How was this patch tested?
Added tests to connect and sql.

Closes #42366 from hvanhovell/SPARK-44686.

Lead-authored-by: Herman van Hovell <[email protected]>
Co-authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Herman van Hovell <[email protected]>
…g3.RandomUtils`

### What changes were proposed in this pull request?
In `commons-lang3` 3.13.0, `RandomUtils` has been marked as `Deprecated`, the Java doc of `commons-lang3` suggests to instead use the api of `commons-rng`.

https://github.com/apache/commons-lang/blob/bcc10b359318397a4d12dbaef22b101725bc6323/src/main/java/org/apache/commons/lang3/RandomUtils.java#L33

```
 * deprecated Use Apache Commons RNG's optimized <a href="https://commons.apache.org/proper/commons-rng/commons-rng-client-api/apidocs/org/apache/commons/rng/UniformRandomProvider.html">UniformRandomProvider</a>

```

However, as Spark only uses `RandomUtils` in test code, so this pr attempts to replace `RandomUtils` with `ThreadLocalRandom` to avoid introducing additional third-party dependencies.

### Why are the changes needed?
Clean up the use of Deprecated api.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

Closes #42370 from LuciferYang/RandomUtils-2-ThreadLocalRandom.

Authored-by: yangjie01 <[email protected]>
Signed-off-by: yangjie01 <[email protected]>
### What changes were proposed in this pull request?
This PR aims to change exceptions created in package org.apache.spark.serializer to use error class.

### Why are the changes needed?
This is to move exceptions created in package org.apache.spark.serializer to error class.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing tests.

Closes #42243 from bozhang2820/spark-38475.

Lead-authored-by: Bo Zhang <[email protected]>
Co-authored-by: Bo Zhang <[email protected]>
Signed-off-by: Max Gekk <[email protected]>
…to a complex type

### What changes were proposed in this pull request?

Fixes AssertionError when converting UDTF output to a complex type by ignore assertions in `_create_converter_from_pandas` to make Arrow raise an error.

### Why are the changes needed?

There is an assertion in `_create_converter_from_pandas`, but it should not be applied for Python UDTF case.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added/modified the related tests.

Closes #42310 from ueshin/issues/SPARK-44561/udtf_complex_types.

Authored-by: Takuya UESHIN <[email protected]>
Signed-off-by: Takuya UESHIN <[email protected]>
…SparkContext` is stopped

### What changes were proposed in this pull request?

This PR is a minor log change which aims to use `INFO`-level log instead of `WARN`-level in `ExecutorPodsWatcher.onClose` if `SparkContext` is stopped. Since Spark can distinguish the expected behavior from the error cases, Spark had better avoid WARNING.

### Why are the changes needed?

Previously, we have `WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed` message.
```
23/08/07 18:10:14 INFO SparkContext: SparkContext is stopping with exitCode 0.
23/08/07 18:10:14 WARN TaskSetManager: Lost task 2594.0 in stage 0.0 (TID 2594) ([2620:149:100d:1813::3f86] executor 1615): TaskKilled (another attempt succeeded)
23/08/07 18:10:14 INFO TaskSetManager: task 2594.0 in stage 0.0 (TID 2594) failed, but the task will not be re-executed (either because the task failed with a shuffle data fetch failure, so the previous stage needs to be re-run, or because a different copy of the task has already succeeded).
23/08/07 18:10:14 INFO SparkUI: Stopped Spark web UI at http://xxx:4040
23/08/07 18:10:14 INFO KubernetesClusterSchedulerBackend: Shutting down all executors
23/08/07 18:10:14 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each executor to shut down
23/08/07 18:10:14 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed.
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

Closes #42381 from dongjoon-hyun/SPARK-44707.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request?

This PR proposes to remove `Int64Index` & `Float64Index` from pandas API on Spark.

### Why are the changes needed?

To match the behavior with pandas 2 and above.

### Does this PR introduce _any_ user-facing change?

Yes, the `Int64Index` & `Float64Index` will be removed.

### How was this patch tested?

Enabling the existing doctests & UTs.

Closes #42267 from itholic/SPARK-43245.

Authored-by: itholic <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
### What changes were proposed in this pull request?

- Implement basic error translation for spark connect scala client.

### Why are the changes needed?

- Better compatibility with the existing control flow

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

`build/sbt "connect-client-jvm/testOnly *Suite"`

Closes #42266 from heyihong/SPARK-44575.

Authored-by: Yihong He <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
### What changes were proposed in this pull request?
This PR moves `Triggers.scala` and `Trigger.scala` from `sql/core` to `sql/api`, and it removes the duplicates from the connect scala client.

### Why are the changes needed?
Not really needed, just some deduplication.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing tests.

Closes #42368 from hvanhovell/SPARK-44692.

Authored-by: Herman van Hovell <[email protected]>
Signed-off-by: Herman van Hovell <[email protected]>
… codegen failure

### What changes were proposed in this pull request?

Materialize passed join columns as an `IndexedSeq` before passing it to the lower layers.

### Why are the changes needed?

When nesting multiple full outer joins using column names which are a `Stream`, the code generator will generate faulty code resulting in a NPE or bad `UnsafeRow` access at runtime. See the 2 added test cases.

### Why are the changes needed?
Otherwise the code will crash, see the 2 added test cases.  Which show an NPE and a bad `UnsafeRow` access.

### Does this PR introduce _any_ user-facing change?
No, only bug fix.

### How was this patch tested?
A reproduction scenario was created and added to the code base.

Closes #41712 from steven-aerts/SPARK-44132-fix.

Authored-by: Steven Aerts <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
…Cached of RDD

### What changes were proposed in this pull request?

This PR is a typo improvement for increasing the number of significant digits for Fraction Cached of RDD that shows on the Storage Tab.

### Why are the changes needed?

improves accuracy and precision

![image](https://github.com/apache/spark/assets/8326978/7106352c-b806-4953-8938-c4cba8ea1191)

### Does this PR introduce _any_ user-facing change?

Yes, the Fraction Cached on Storage Page increases the fractional length from 0 to 2

### How was this patch tested?

locally verified

Closes #42373 from yaooqinn/uiminor.

Authored-by: Kent Yao <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
…em as an API

### What changes were proposed in this pull request?

This PR proposes to (mostly) refactor all the internal workarounds to get the active session correctly.

There are few things to note:

- _PySpark without Spark Connect does not already support the hierarchy of active sessions_. With pinned thread mode (enabled by default), PySpark does map each Python thread to JVM thread, but the thread creation happens within gateway server, that does not respect the thread hierarchy. Therefore, this PR follows the exactly same behaviour.
  - New thread will not have an active thread by default.
  - Other behaviours are same as PySpark without Connect, see also #42367
- Since I am here, I piggiyback few documentation changes. We missed document `SparkSession.readStream`, `SparkSession.streams`, `SparkSession.udtf`, `SparkSession.conf` and `SparkSession.version` in Spark Connect.
- The changes here are mostly refactoring that reuses existing unittests while I expose two methods:
  - `SparkSession.getActiveSession` (only for Spark Connect)
  - `SparkSession.active` (for both in PySpark)

### Why are the changes needed?

For Spark Connect users to be able to play with active and default sessions in Python.

### Does this PR introduce _any_ user-facing change?

Yes, it adds new API:
  - `SparkSession.getActiveSession` (only for Spark Connect)
  - `SparkSession.active` (for both in PySpark)

### How was this patch tested?

Existing unittests should cover all.

Closes #42371 from HyukjinKwon/SPARK-44694.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
JoshRosen and others added 29 commits August 23, 2023 13:54
…TransportClientFactory.createClient()

### What changes were proposed in this pull request?

#41785 / SPARK-44241 introduced a new `awaitUninterruptibly()` call in one branch of `TrasportClientFactory.createClient()` (executed when the connection create timeout is non-positive). This PR replaces that call with an interruptible `await()` call.

Note that the other pre-existing branches in this method were already using `await()`.

### Why are the changes needed?

Uninterruptible waiting can cause problems when cancelling tasks. For details, see #16866 / SPARK-19529, an older PR fixing a similar issue in this same `TransportClientFactory.createClient()` method.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #42619 from JoshRosen/remove-awaitUninterruptibly.

Authored-by: Josh Rosen <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
### What changes were proposed in this pull request?

8ff6b7a#diff-f4df4ce19570230091c3b2432e3c84cd2db7059c7b2a03213d272094bd940454 refactors antlr4 files to `sql/api` but checked in `SqlBaseLexer.tokens`.

This file is generated so we do not need to check it in.

### Why are the changes needed?

Remove file that do not need to be checked in.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing Test.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #42620 from amaliujia/remove_checked_in_token_file.

Authored-by: Rui Wang <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
### What changes were proposed in this pull request?

This PR refines the docstring of `DataFrame.collect()` function.

### Why are the changes needed?

To improve the documentation of PySpark.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

doc test

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #42592 from allisonwang-db/spark-44899-collect-docs.

Authored-by: allisonwang-db <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
…r when it is not available

### What changes were proposed in this pull request?

Skip starting torch distributor log streaming server when it is not available.

In some cases, e.g., in a databricks connect cluster, there is some network limitation that casues starting log streaming server failure, but, this does not need to break torch distributor training routine.

In this PR, it captures exception raised from log server `start` method, and set server port to be -1 if `start` failed.

### Why are the changes needed?

In some cases, e.g., in a databricks connect cluster, there is some network limitation that casues starting log streaming server failure, but, this does not need to break torch distributor training routine.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

UT.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #42606 from WeichenXu123/fix-torch-log-server-in-connect-mode.

Authored-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
…regenerating files

### What changes were proposed in this pull request?
The pr aims to fix some bug in regenerating pyspark docs in certain scenarios.

### Why are the changes needed?
- The following error occurred while I was regenerating the pyspark document.
   <img width="1001" alt="image" src="https://github.com/apache/spark/assets/15246973/548abd63-4349-4267-b1fe-a293bd1e7f3e">

- We can simply reproduce this problem as follows:
 1.git reset --hard 3f380b9
    <img width="1416" alt="image" src="https://github.com/apache/spark/assets/15246973/5ab9c8fc-5835-4ced-8d92-9d5e020b262a">
 2.make clean html, (at this point, it is successful.)
    <img width="1000" alt="image" src="https://github.com/apache/spark/assets/15246973/5c3ce07f-cbe8-4177-ae22-b16c3fc62e01">
3.git pull, (at this point, the function `chr` has been deleted, but the previously generated file(`pyspark.sql.functions.chr.rst`) will not be deleted.)
4.make clean html, (at this point, it is failed.)
    <img width="1001" alt="image" src="https://github.com/apache/spark/assets/15246973/548abd63-4349-4267-b1fe-a293bd1e7f3e">

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
1.Pass GA.
2.Manually test.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #42622 from panbingkun/SPARK-44923.

Authored-by: panbingkun <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
…ality

### What changes were proposed in this pull request?

Fix cross validator foldCol param functionality.
In main branch the code calls `df.rdd` APIs but it is not supported in spark connect

### Why are the changes needed?

Bug fix.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

UT.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #42605 from WeichenXu123/fix-tuning-connect-foldCol.

Authored-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
### What changes were proposed in this pull request?
Upgrade Apache ivy from 2.5.1 to 2.5.2

[Release notes](https://lists.apache.org/thread/9gcz4xrsn8c7o9gb377xfzvkb8jltffr)

### Why are the changes needed?
[CVE-2022-46751](https://www.cve.org/CVERecord?id=CVE-2022-46751)

The fix apache/ant-ivy@2be17bc
### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #42613 from bjornjorgensen/ivy-2.5.2.

Authored-by: Bjørn Jørgensen <[email protected]>
Signed-off-by: yangjie01 <[email protected]>
…ueries

### What changes were proposed in this pull request?

Support window functions in correlated scalar subqueries.
Support in IN/EXISTS subqueries will come after we migrate them into DecorrelateInnerQuery framework. In addition, correlation is not yet supported inside the window function itself.

### Why are the changes needed?

Supports more subqueries.

### Does this PR introduce _any_ user-facing change?

Yes, users can run more subqueries now.

### How was this patch tested?

Unit and query tests. The results of query tests were checked against Postgresql.

Closes #42383 from agubichev/SPARK-44549-corr-window.

Authored-by: Andrey Gubichev <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request?
remove resolved todo items

### Why are the changes needed?
Code cleanup

### Does this PR introduce _any_ user-facing change?
no, dev-only

### How was this patch tested?
manually check

### Was this patch authored or co-authored using generative AI tooling?
NO

Closes #42626 from zhengruifeng/py_code_clean.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
…d.planRequest

### What changes were proposed in this pull request?
Add `JsonIgnore` to `SparkListenerConnectOperationStarted.planRequest`

### Why are the changes needed?
`SparkListenerConnectOperationStarted` was added as part of [SPARK-43923](https://issues.apache.org/jira/browse/SPARK-43923).
`SparkListenerConnectOperationStarted.planRequest` cannot be serialized & deserialized from json as it has recursive objects which causes failures when attempting these operations.
```
com.fasterxml.jackson.databind.exc.InvalidDefinitionException: Direct self-reference leading to cycle (through reference chain: org.apache.spark.sql.connect.service.SparkListenerConnectOperationStarted["planRequest"]->org.apache.spark.connect.proto.ExecutePlanRequest["unknownFields"]->grpc_shaded.com.google.protobuf.UnknownFieldSet["defaultInstanceForType"])
	at com.fasterxml.jackson.databind.exc.InvalidDefinitionException.from(InvalidDefinitionException.java:77)
	at com.fasterxml.jackson.databind.SerializerProvider.reportBadDefinition(SerializerProvider.java:1308)
```

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Unit

Closes #42550 from jdesjean/SPARK-44861.

Authored-by: jdesjean <[email protected]>
Signed-off-by: Herman van Hovell <[email protected]>
### What changes were proposed in this pull request?

Improve the error messaging on the connect client when using
a UDF whose corresponding class has not been sync'ed with the
spark connect service.

Prior to this change, the client receives a cryptic error:

```
Exception in thread "main" org.apache.spark.SparkException: Main$
```

With this change, the message is improved to be:

```
Exception in thread "main" org.apache.spark.SparkException: Failed to load class: Main$. Make sure the artifact where the class is defined is installed by calling session.addArtifact.
```

### Why are the changes needed?

This change makes it clear to the user on what the error is.

### Does this PR introduce _any_ user-facing change?

Yes. The error message is improved. See details above.

### How was this patch tested?

Manually by running a connect server and client.

Closes #42500 from nija-at/improve-error.

Authored-by: Niranjan Jayakar <[email protected]>
Signed-off-by: Herman van Hovell <[email protected]>
… substitute annotations instead of feature steps

### What changes were proposed in this pull request?

Move `Utils. SubstituteAppNExecIds` logic  into `KubernetesConf.annotations` as the default logic,

### Why are the changes needed?

Easy for users to reuse, rather than to rewrite it again at the same logic.

When user write custom feature step and using annotations, before this pr, they should call `Utils. SubstituteAppNExecIds` once.

### Does this PR introduce _any_ user-facing change?

Yes, but no sense for user to use annotations.

### How was this patch tested?

Add unit test

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #42600 from zwangsheng/SPARK-44906.

Lead-authored-by: zwangsheng <[email protected]>
Co-authored-by: Kent Yao <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
… in Docker images if exists

### What changes were proposed in this pull request?

This PR aims to fix `RELEASE` file to have the correct information in Docker images if `RELEASE` file exists.

Please note that `RELEASE` file doesn't exists in SPARK_HOME directory when we run the K8s integration test from Spark Git repository. So, we keep the following empty `RELEASE` file generation and use `COPY` conditionally via glob syntax.

https://github.com/apache/spark/blob/2a3aec1f9040e08999a2df88f92340cd2710e552/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile#L37

### Why are the changes needed?

Currently, it's an empty file in the official Apache Spark Docker images.

```
$ docker run -it --rm apache/spark:latest ls -al /opt/spark/RELEASE
-rw-r--r-- 1 spark spark 0 Jun 25 03:13 /opt/spark/RELEASE

$ docker run -it --rm apache/spark:v3.1.3 ls -al /opt/spark/RELEASE | tail -n1
-rw-r--r-- 1 root root 0 Feb 21  2022 /opt/spark/RELEASE
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually build image and check it with `docker run -it --rm NEW_IMAGE ls -al /opt/spark/RELEASE`

I copied this `Dockerfile` into Apache Spark 3.5.0 RC2 binary distribution and tested in the following way.
```
$ cd spark-3.5.0-rc2-bin-hadoop3

$ cp /tmp/Dockerfile kubernetes/dockerfiles/spark/Dockerfile

$ bin/docker-image-tool.sh -t SPARK-44935 build

$ docker run -it --rm docker.io/library/spark:SPARK-44935 ls -al /opt/spark/RELEASE | tail -n1
-rw-r--r-- 1 root root 165 Aug 18 21:10 /opt/spark/RELEASE

$ docker run -it --rm docker.io/library/spark:SPARK-44935 cat /opt/spark/RELEASE | tail -n2
Spark 3.5.0 (git revision 010c4a6) built for Hadoop 3.3.4
Build flags: -B -Pmesos -Pyarn -Pkubernetes -Psparkr -Pscala-2.12 -Phadoop-3 -Phive -Phive-thriftserver
```
### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #42636 from dongjoon-hyun/SPARK-44935.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
…ring creation

### What changes were proposed in this pull request?

`SparkSession.Builder` now applies configuration options to the create `SparkSession`.

### Why are the changes needed?

It is reasonable to expect PySpark connect `SparkSession.Builder` to behave in the same way as other `SparkSession.Builder`s in Spark Connect. The `SparkSession.Builder` should apply the provided configuration options to the created `SparkSesssion`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Tests were added to verify that configuration options were applied to the `SparkSession`.

Closes #42548 from michaelzhan-db/SPARK-44750.

Lead-authored-by: Michael Zhang <[email protected]>
Co-authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
…lumn name

### What changes were proposed in this pull request?
make `df['col_name']` validate the column name

### Why are the changes needed?
for parity

### Does this PR introduce _any_ user-facing change?
yes

before

```
In [1]: df = spark.range(0, 10)

In [2]: df["bad_key"]
Out[2]: Column<'bad_key'>

```

after

```
In [1]: df = spark.range(0, 10)

In [2]: df["bad_key"]
23/08/23 17:23:35 ERROR ErrorUtils: Spark Connect RPC error during: analyze. UserId: ruifeng.zheng. SessionId: 59de3f10-14b6-4239-be85-4156da43d495.
org.apache.spark.sql.AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter with name `bad_key` cannot be resolved. Did you mean one of the following? [`id`].;
'Project ['bad_key]
+- Range (0, 10, step=1, splits=Some(12))

...

AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter with name `bad_key` cannot be resolved. Did you mean one of the following? [`id`].;
'Project ['bad_key]
+- Range (0, 10, step=1, splits=Some(12))
```

### How was this patch tested?
enabled UT

### Was this patch authored or co-authored using generative AI tooling?
NO

Closes #42608 from zhengruifeng/tests_enable_test_access_column.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
### What changes were proposed in this pull request?
This pr refine docstring of `approx_count_distinct ` and add some new examples.

### Why are the changes needed?
To improve PySpark documentation

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass Github Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #42596 from LuciferYang/approx-pydoc.

Authored-by: yangjie01 <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
### What changes were proposed in this pull request?
Fix typos in pyspark_upgrade.rst

### Why are the changes needed?
Correct English makes it easier to read and understand.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #42631 from bjornjorgensen/typo-in-pyspark_upgrade.

Authored-by: Bjørn Jørgensen <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
…memory limit

### What changes were proposed in this pull request?

This PR aims to simplify the log when Spark HybridStore hits the memory limit.

### Why are the changes needed?

`HistoryServerMemoryManager.lease` throws `RuntimeException`s frequently when the current usage is high.

https://github.com/apache/spark/blob/d382c6b3aef28bde6adcdf62b7be565ff1152942/core/src/main/scala/org/apache/spark/deploy/history/HistoryServerMemoryManager.scala#L52-L55

In this case, although Apache Spark shows `RuntimeException` as `INFO` level log, HybridStore works fine by fallback to disk store. So, there is no need to surprise the users with `RuntimeException` in the log. After this PR, we will provide a simpler message with the all messages without StrackTrace and `RuntimeException`.

**BEFORE**
```
23/08/23 22:40:34 INFO FsHistoryProvider: Failed to create HybridStore for spark-xxx/None. Using ROCKSDB.
java.lang.RuntimeException: Not enough memory to create hybrid store for app spark-xxx / None.
	at org.apache.spark.deploy.history.HistoryServerMemoryManager.lease(HistoryServerMemoryManager.scala:54)
	at org.apache.spark.deploy.history.FsHistoryProvider.createHybridStore(FsHistoryProvider.scala:1256)
	at org.apache.spark.deploy.history.FsHistoryProvider.loadDiskStore(FsHistoryProvider.scala:1231)
	at org.apache.spark.deploy.history.FsHistoryProvider.getAppUI(FsHistoryProvider.scala:342)
	at org.apache.spark.deploy.history.HistoryServer.getAppUI(HistoryServer.scala:199)
	at org.apache.spark.deploy.history.ApplicationCache.$anonfun$loadApplicationEntry$2(ApplicationCache.scala:163)
	at org.apache.spark.deploy.history.ApplicationCache.time(ApplicationCache.scala:134)
	at org.apache.spark.deploy.history.ApplicationCache.org$apache$spark$deploy$history$ApplicationCache$$loadApplicationEntry(ApplicationCache.scala:161)
	at org.apache.spark.deploy.history.ApplicationCache$$anon$1.load(ApplicationCache.scala:55)
	at org.apache.spark.deploy.history.ApplicationCache$$anon$1.load(ApplicationCache.scala:51)
	at org.sparkproject.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
	at org.sparkproject.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
	at org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
	at org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
	at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4000)
	at org.sparkproject.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
	at org.sparkproject.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
	at org.apache.spark.deploy.history.ApplicationCache.get(ApplicationCache.scala:88)
	at org.apache.spark.deploy.history.ApplicationCache.withSparkUI(ApplicationCache.scala:100)
	at org.apache.spark.deploy.history.HistoryServer.org$apache$spark$deploy$history$HistoryServer$$loadAppUi(HistoryServer.scala:256)
	at org.apache.spark.deploy.history.HistoryServer$$anon$1.doGet(HistoryServer.scala:104)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:503)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:590)
	at org.sparkproject.jetty.servlet.ServletHolder.handle(ServletHolder.java:799)
	at org.sparkproject.jetty.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1656)
	at org.apache.spark.ui.HttpSecurityFilter.doFilter(HttpSecurityFilter.scala:95)
	at org.sparkproject.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
	at org.sparkproject.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
	at org.sparkproject.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:552)
	at org.sparkproject.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
	at org.sparkproject.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440)
	at org.sparkproject.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)
	at org.sparkproject.jetty.servlet.ServletHandler.doScope(ServletHandler.java:505)
	at org.sparkproject.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)
	at org.sparkproject.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355)
	at org.sparkproject.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
	at org.sparkproject.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:772)
	at org.sparkproject.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:234)
	at org.sparkproject.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
	at org.sparkproject.jetty.server.Server.handle(Server.java:516)
	at org.sparkproject.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487)
	at org.sparkproject.jetty.server.HttpChannel.dispatch(HttpChannel.java:732)
	at org.sparkproject.jetty.server.HttpChannel.handle(HttpChannel.java:479)
	at org.sparkproject.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)
	at org.sparkproject.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
	at org.sparkproject.jetty.io.FillInterest.fillable(FillInterest.java:105)
	at org.sparkproject.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
	at org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338)
	at org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315)
	at org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173)
	at org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131)
	at org.sparkproject.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:409)
	at org.sparkproject.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)
	at org.sparkproject.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)
	at java.base/java.lang.Thread.run(Thread.java:833)
```

**AFTER**
```
23/08/23 15:49:45 INFO FsHistoryProvider: Failed to create HybridStore for spark-xxx/None. Using ROCKSDB. Not enough memory to create hybrid store for app spark-xxx / None.
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually.

```
spark.history.fs.logDirectory YOUR_HISTORY_DIR
spark.history.store.path /tmp/rocksdb
spark.history.store.hybridStore.enabled true
spark.history.store.hybridStore.maxMemoryUsage 0g
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #42638 from dongjoon-hyun/SPARK-44936.

Lead-authored-by: Dongjoon Hyun <[email protected]>
Co-authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
…'F' in pyspark.sql import functions

### What changes were proposed in this pull request?

This PR proposes the alias name `sf` instead of `F` for `pyspark.sql.functions` alias in public documentation:

```python
from pyspark.sql import functions as sf
```

This PR does not change the internal or test codes as it's too invasive, and might easily cause conflicts.

### Why are the changes needed?

```python
from pyspark.sql import functions as F
```

isn’t very Pythonic - it does not follow PEP 8, see [Package and Module Names](https://peps.python.org/pep-0008/#package-and-module-names).

> Modules should have short, all-lowercase names. Underscores can be used in the module name if it improves
> readability. Python packages should also have short, all-lowercase names, although the use of underscores
> is discouraged.

Therefore, the module’s alias should follow this. In practice, the uppercase is only used at the module/package
level constants in my experience, see also [Constants](https://peps.python.org/pep-0008/#constants).

See also [this stackoverflow comment](https://stackoverflow.com/questions/70458086/how-to-correctly-import-pyspark-sql-functions#comment129714058_70458115).

### Does this PR introduce _any_ user-facing change?

Yes, it changes documentation so users

### How was this patch tested?

Manually checked.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #42628 from HyukjinKwon/SPARK-44928.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
…ests

### What changes were proposed in this pull request?

This PR set a character length limit for the error message and a stack depth limit for error stack traces to the console appender in tests.

The original patterns are

- %d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n%ex
- %t: %m%n%ex

And they're adjusted to the new consistent pattern

- `%d{HH:mm:ss.SSS} %p %c: %maxLen{%m}{512}%n%ex{8}%n`

### Why are the changes needed?

In testing, intentional and unintentional failures are created to generate extensive log volumes. For instance, a single FileNotFound error may be logged multiple times in the writer, task runner, task set manager, and other areas, resulting in thousands of lines per failure.

For example, tests in ParquetRebaseDatetimeSuite will be run with V1 and V2 Datasource, two or more specific values, and multiple configuration pairs. I have seen the SparkUpgradeException all over the CI logs

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

```
build/sbt "sql/testOnly *ParquetRebaseDatetimeV1Suite"
```

```
15:59:55.446 ERROR org.apache.spark.sql.execution.datasources.FileFormatWriter: Job job_202308230059551630377040190578321_1301 aborted.
15:59:55.446 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 1301.0 (TID 1595)
org.apache.spark.SparkException: [TASK_WRITE_FAILED] Task failed while writing rows to file:/Users/hzyaoqin/spark/target/tmp/spark-67cce58e-dfb2-4811-a9c0-50ec4c90d1f1.
	at org.apache.spark.sql.errors.QueryExecutionErrors$.taskFailedWhileWritingRowsError(QueryExecutionErrors.scala:765)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:420)
	at org.apache.spark.sql.execution.datasources.WriteFilesExec.$anonfun$doExecuteWrite$1(WriteFiles.scala:100)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
15:59:55.446 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in stage 1301.0 (TID 1595) (10.221.97.38 executor driver): org.apache.spark.SparkException: [TASK_WRITE_FAILED] Task failed while writing rows to file:/Users/hzyaoqin/spark/target/tmp/spark-67cce58e-dfb2-4811-a9c0-50ec4c90d1f1.
	at org.apache.spark.sql.errors.QueryExecutionErrors$.taskFailedWhileWritingRowsError(QueryExecutionErrors.scala:765)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:420)
	at org.apache.spark.sql.execution.datasources....
15:59:55.446 ERROR org.apache.spark.scheduler.TaskSetManager: Task 0 in stage 1301.0 failed 1 times; aborting job
15:59:55.447 ERROR org.apache.spark.sql.execution.datasources.FileFormatWriter: Aborting job 0ead031e-c9dd-446b-b20b-c76ec54978b1.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1301.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1301.0 (TID 1595) (10.221.97.38 executor driver): org.apache.spark.SparkException: [TASK_WRITE_FAILED] Task failed while writing rows to file:/Users/hzyaoqin/spark/target/tmp/spark-67cce58e-dfb2-4811-a9c0-50ec4c90d1f1.
	at org.apache.spark.sql.errors.QueryExecutionErrors$.taskFailedWhileWritingRowsError(QueryExecutionErrors.scala:765)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:420)
	at org.apache.spark.sql.execution.datasources.WriteFilesExec.$anonfun$doExecuteWrite$1(WriteFiles.scala:100)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
15:59:55.579 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 1303.0 (TID 1597)
```

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #42627 from yaooqinn/SPARK-44929.

Authored-by: Kent Yao <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
### What changes were proposed in this pull request?

This PR aims to update `SystemRequirements` to support Java 21 in SparkR.

### Why are the changes needed?

To support Java 21 officially in SparkR. We've been running SparkR CI on master branch.
- https://github.com/apache/spark/actions/runs/5946839640/job/16128043220

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

On Java 21, do the following.
```
$ build/sbt test:package -Psparkr -Phive

$ R/install-dev.sh; R/run-tests.sh
...
Status: 2 NOTEs
See
  ‘/Users/dongjoon/APACHE/spark-merge/R/SparkR.Rcheck/00check.log’
for details.

+ popd
Tests passed.
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #42645 from dongjoon-hyun/SPARK-44939.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request?
Correct a function alias

### Why are the changes needed?
it should be `sign`

### Does this PR introduce _any_ user-facing change?
actually no, since `pyspark.sql.connect.function` shares the same namespace with `pyspark.sql.function`

also manually check (before this PR)
```
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 4.0.0.dev0
      /_/

Using Python version 3.10.11 (main, May 17 2023 14:30:36)
Client connected to the Spark Connect server at localhost
SparkSession available as 'spark'.

In [1]: from pyspark.sql import functions as sf

In [2]: sf.sign
Out[2]: <function pyspark.sql.functions.signum(col: 'ColumnOrName') -> pyspark.sql.column.Column>

In [3]: sf.sigh
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[3], line 1
----> 1 sf.sigh

AttributeError: module 'pyspark.sql.functions' has no attribute 'sigh'
```

### How was this patch tested?
CI

### Was this patch authored or co-authored using generative AI tooling?
NO

Closes #42642 from zhengruifeng/spark_43943_followup.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
### What changes were proposed in this pull request?
The pr aims to upgrade arrow from 12.0.1 to 13.0.0.

### Why are the changes needed?
1. Arrow 12.0.1 VS 13.0.0
apache/arrow@apache-arrow-12.0.1...apache-arrow-13.0.0

2. Arrow 13.0.0 is the first version to support Java 21.
    - apache/arrow#36370

When arrow version 12.0.1 running on Java21, the following error occurred:
```
java.lang.UnsupportedOperationException: sun.misc.Unsafe or java.nio.DirectByteBuffer.<init>(long, int) not available
       org.apache.arrow.memory.util.MemoryUtil.directBuffer(MemoryUtil.java:167)
       org.apache.arrow.memory.ArrowBuf.getDirectBuffer(ArrowBuf.java:228)
       org.apache.arrow.memory.ArrowBuf.nioBuffer(ArrowBuf.java:223)
       org.apache.arrow.vector.ipc.ReadChannel.readFully(ReadChannel.java:87)
       org.apache.arrow.vector.ipc.message.MessageSerializer.readMessageBody(MessageSerializer.java:727)
       org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:67)
       org.apache.arrow.vector.ipc.ArrowStreamReader.loadNextBatch(ArrowStreamReader.java:145)
```
After this PR, we can try to enable netty related testing in Java21.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
- Pass GA.

Closes #42181 from panbingkun/SPARK-44563.

Authored-by: panbingkun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request?

This PR aims to re-enable `test_sparkSQL_arrow.R` in Java 21.
This depends on #42181 .

### Why are the changes needed?

To have Java 21 test coverage.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

```
$ java -version
openjdk version "21" 2023-09-19
OpenJDK Runtime Environment (build 21+35-2513)
OpenJDK 64-Bit Server VM (build 21+35-2513, mixed mode, sharing)

$ build/sbt test:package -Psparkr -Phive

$ R/install-dev.sh; R/run-tests.sh
...
sparkSQL:
SparkSQL functions: ..........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
...........................................................................................................................................................................................................................................................................................................................................................................................................
...
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #42644 from dongjoon-hyun/SPARK-44127.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
…ql.execution.arrow tests in Java 21

### What changes were proposed in this pull request?

This PR aims to re-enable `PandasUDF` and `o.a.s.sql.execution.arrow` tests in Java 21.
This depends on #42181 .

### Why are the changes needed?

To have Java 21 test coverage.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Run the following on Java 21.
```
$ java -version
openjdk version "21" 2023-09-19
OpenJDK Runtime Environment (build 21+35-2513)
OpenJDK 64-Bit Server VM (build 21+35-2513, mixed mode, sharing)

$ build/sbt "sql/testOnly *.ArrowConvertersSuite"
...
[info] Run completed in 5 seconds, 316 milliseconds.
[info] Total number of tests run: 30
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 30, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.

$ build/sbt "sql/testOnly *.SQLQueryTestSuite"
...
[info] Run completed in 12 minutes, 4 seconds.
[info] Total number of tests run: 629
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 629, failed 0, canceled 0, ignored 2, pending 0
[info] All tests passed.
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #42641 from dongjoon-hyun/SPARK-44097.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
…21 after the new arrow version release

### What changes were proposed in this pull request?
This PR aims to re-enable PySpark in Java 21.
This depends on #42181 .

### Why are the changes needed?
To have Java 21 test coverage.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #42646 from panbingkun/SPARK-44302.

Authored-by: panbingkun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
…va 21

### What changes were proposed in this pull request?

This PR aims to re-enable Arrow-based connect tests in Java 21.
This depends on #42181.

### Why are the changes needed?

To have Java 21 test coverage.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

```
$ java -version
openjdk version "21-ea" 2023-09-19
OpenJDK Runtime Environment (build 21-ea+32-2482)
OpenJDK 64-Bit Server VM (build 21-ea+32-2482, mixed mode, sharing)

$ build/sbt "connect/test" -Phive
...
[info] Run completed in 14 seconds, 136 milliseconds.
[info] Total number of tests run: 858
[info] Suites: completed 20, aborted 0
[info] Tests: succeeded 858, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 44 s, completed Aug 23, 2023, 9:42:53 PM

$ build/sbt "connect-client-jvm/test" -Phive
...
[info] Run completed in 1 minute, 24 seconds.
[info] Total number of tests run: 1220
[info] Suites: completed 24, aborted 0
[info] Tests: succeeded 1220, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[info] Passed: Total 1222, Failed 0, Errors 0, Passed 1222
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #42643 from dongjoon-hyun/SPARK-44121.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
…ark UI

### What changes were proposed in this pull request?

This PR added a button to download thread dump as a txt w.r.t. jstack formatting

### Why are the changes needed?

The formatting of raw jstack can be relatively easy to read and analyze using various thread tools.

### Does this PR introduce _any_ user-facing change?

![image](https://github.com/apache/spark/assets/8326978/86c8e87d-970d-4ddb-967e-20c1d3534d42)

### How was this patch tested?

#### Raw Dump File
[driver.txt](https://github.com/apache/spark/files/12388302/driver.txt)

#### Reporting by external tools

https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMjMvMDgvMjAvZHJpdmVyLnR4dC0tMTMtNTMtMjI=&

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #42575 from yaooqinn/SPARK-44863.

Authored-by: Kent Yao <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
### What changes were proposed in this pull request?

Add several new test cases for streaming foreachBatch and streaming query listener events to test various scenarios.

### Why are the changes needed?

More tests is better

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Test only change

Closes #42521 from WweiL/SPARK-44435-tests-foreachBatch-listener.

Authored-by: Wei Liu <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
@Benyuel Benyuel merged commit c4ee1bb into Benyuel:master Aug 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.