Skip to content

Conversation

@tdas
Copy link
Contributor

@tdas tdas commented Nov 19, 2025

🥞 Stacked PR

Use this link to review incremental changes.


Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

Establishes foundational Unity Catalog integration for Delta Lake with embedded UC server lifecycle management and comprehensive test framework.

Adds new sparkUnityCatalog SBT module with Unity Catalog 0.3.0 dependencies, UnityCatalogSupport trait for managing UC server lifecycle in tests, and UnityCatalogSupportSuite with 4 integration tests validating UC-Delta connectivity and table operations. Uses shaded UC server JAR to avoid dependency conflicts. Compatible with Spark 4.0 and Delta Lake. Includes repository setup (.gitignore updates).

How was this patch tested?

  • Added UnityCatalogSupportSuite with 4 comprehensive integration tests
  • Tests validate UC server startup, catalog creation, and basic table operations
  • All tests pass with Unity Catalog 0.3.0 and Spark 4.0
  • Validates end-to-end UC server connectivity and Delta table registration

Does this PR introduce any user-facing changes?

No. This PR only adds test infrastructure for Unity Catalog integration. No user-facing functionality is changed.

@tdas tdas changed the title spark-uc module Unity Catalog integration foundation and repository setup for Delta Lake Nov 19, 2025
@tdas tdas changed the title Unity Catalog integration foundation and repository setup for Delta Lake [Spark] Unity Catalog integration foundation and repository setup for Delta Lake Nov 19, 2025
tdas added 3 commits November 18, 2025 20:47
- Moved UnityCatalogSupport.scala and UnityCatalogSupportSuite.scala to com.sparkuctest package
- Updated package declarations from org.apache.spark.sql.delta to com.sparkuctest
- All 4 tests pass successfully with Spark 4.0
@tdas tdas force-pushed the stack/spark-uc-pr1-foundation branch from 8d685a6 to 1b7fe04 Compare November 19, 2025 01:48
Comment on lines 164 to 167
// scalastyle:off println
println(s"Unity Catalog server started and ready at $unityCatalogUri")
println(s"Created catalog '$unityCatalogName' with schema 'default'")
// scalastyle:on println
Copy link
Contributor Author

@tdas tdas Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make all prints into log lines

Comment on lines 138 to 139
Thread.sleep(5000)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of sleeping a fixed out can we poll to check when the server is ready

Comment on lines 58 to 66
private def createUCClient(): ApiClient = {
val client = new ApiClient()
// Extract port from unityCatalogUri
val port = unityCatalogUri.split(":")(2).toInt
client.setScheme("http")
client.setHost("localhost")
client.setPort(port)
client
}
Copy link
Contributor Author

@tdas tdas Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can the UnityCatalogSupport trait provide the client

// Standard test dependencies
"org.scalatest" %% "scalatest" % scalaTestVersion % "test",

// Unity Catalog dependencies - exclude Jackson to use Spark's Jackson 2.15.x
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove jackson version

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need to change the unitycatalog's jackson version, so that we can align the spark's.

Copy link
Collaborator

@huan233usc huan233usc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! just some questions on server setup and formatting.

@@ -0,0 +1,258 @@
/*
* Copyright (2021) The Delta Lake Project Authors.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: 2025

* limitations under the License.
*/

package com.sparkuctest
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is org.apache.spark.sql.delta.test.unty a better naming?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am intentionally keeping it out of the package so that we dont accidentally depend or anything internal to delta or uc. closer to how any user / app would run

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it's not a com package right ? maybe we can use io.unitycatalog.ittest ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, should we write the test in scala or java ? will it be better to use Java to maintain those tests ?

ucTempDir.deleteOnExit()

// Find an available port
ucPort = findAvailablePort()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: There is a small potential for the racing of port if the port is taken between here and Line 147. Should we include these logic in the retry? (I think might be ok to keep things as is as it is a test that is not parallelzable)

val response: ListTablesResponse = tablesApi.listTables(catalogName, schemaName, null, null)

import scala.jdk.CollectionConverters._
if (response.getTables != null) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The null check here seems to be redundant. If there is no tables, we will get empty list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is to handle the case if any server returns the corresponding field in the response json as null (that is field is missing) instead of an empty [].

* Creates a Unity Catalog API client configured for this server.
*/
def createUnityCatalogClient(): ApiClient = {
val port = unityCatalogUri.split(":")(2).toInt
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we directly us ucPort

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haaah. stupid claude.

// 3. Verify we can query UC server directly via SDK
val ucTables = listTables(unityCatalogName, "default")
// Should succeed even if empty - this confirms UC server is responding
assert(ucTables != null, "Should be able to query UC server via SDK")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion should always pass as listTables never returns null, should we remove?

""")

// If we got here, the catalog is working
val tables = spark.sql(s"SHOW TABLES IN $unityCatalogName.default")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does checkAnswer works?https://github.com/apache/spark/blob/61668ad02671af0a80ec6d91ddaf02e6b3f042e6/sql/core/src/test/scala/org/apache/spark/sql/QueryTest.scala#L137

checkAnswer(
    spark.sql(s"SHOW TABLES IN $unityCatalogName.default")
        .select("tableName")
        .filter($"tableName" === "test_verify_catalog"),
    Row("test_verify_catalog") :: Nil
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like more code than this version :)

Comment on lines +108 to +109
val result = spark.sql(s"SELECT * FROM $testTable ORDER BY id").collect()
assert(result.length == 3, s"Should have 3 rows, got ${result.length}")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should use checkAnswer

Copy link
Contributor

@openinx openinx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work, just left several comments. It's a great work to get start with the unitycatalog integration testing work.

kernel/kernel-benchmarks/benchmark_report.json

# Unity Catalog test artifacts
spark/unitycatalog/etc/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which tests will generate those artifacts ?



val unityCatalogVersion = "0.3.0"
val sparkUnityCatalogJacksonVersion = "2.15.4" // We are using Spark 4.0's Jackson version 2.15.x, to override Unity Catalog 0.3.0's version 2.18.x
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated: I think we may need to relocate the Jackson in the oss-unitycatalog spark connector, since it's such a general client-side jar, which will run under different runtime environments. And for those common jars, it can be easily conflicted with the jars introduced from other projects.

I had an issue for oss-unitycatalog before: unitycatalog/unitycatalog#1141

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree.. jackson was a massive pain. we need to fix it.


lazy val sparkUnityCatalog = (project in file("spark/unitycatalog"))
.dependsOn(spark % "compile->compile;test->test;provided->provided")
.disablePlugins(JavaFormatterPlugin, ScalafmtPlugin)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why disable the java and scala formatter plugin ? Seems weird ?

// This is a test-only module - no production sources
Compile / sources := Seq.empty,

Test / javaOptions ++= Seq("-ea"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like uncommon to use -ea (or -enableassertions) for testing ? If we all use the org.assertj, then maybe we can remove this option, do I understand correctly ?

// Standard test dependencies
"org.scalatest" %% "scalatest" % scalaTestVersion % "test",

// Unity Catalog dependencies - exclude Jackson to use Spark's Jackson 2.15.x
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need to change the unitycatalog's jackson version, so that we can align the spark's.

),
assembly / assemblyMergeStrategy := {
// Discard `module-info.class` to fix the `different file contents found` error.
// TODO Upgrade SBT to 1.5 which will do this automatically
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Is this related to this PR ?

Comment on lines +158 to +161
val testClient = new ApiClient()
testClient.setScheme("http")
testClient.setHost("localhost")
testClient.setPort(ucPort)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use the createUnityCatalogClient directly ?


test("UnityCatalogSupport trait starts UC server and configures Spark correctly") {
// 1. Verify UC server is accessible via URI
assert(unityCatalogUri.startsWith("http://localhost:"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will it work if we want to run this same tests on different UC server setup ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can make this a env variable. if this env variable is set, then we use that URL instead of starting a new UC OSS server.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants