Skip to content

Commit d24c718

Browse files
wangyumdongjoon-hyun
authored andcommitted
[SPARK-52574][SQL][TESTS] Ensure compression codec is correctly applied in Hive tables and dirs
### What changes were proposed in this pull request? This PR adds a test to verify that the compression codec specified in `spark.sql.parquet.compression.codec` is correctly applied to Hive tables and directories during table creation and data insertion. ### Why are the changes needed? We add compression codec through `setupHadoopConfForCompression` and keep compress codec in hadoopConf: https://github.com/apache/spark/blob/3e0808c33f185c13808ce2d547ce9ba0057d31a6/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/V1WritesHiveUtils.scala#L114-L141 However, we might accidentally use `sparkSession.sessionState.newHadoopConf()` instead of the old hadoopConf. This can cause the compression codec to be lost. So, we add a test to make sure this does not happen. Keep compression codec | Lost compression codec -- | -- <img width="747" alt="image" src="https://github.com/user-attachments/assets/67c5c3c6-a265-48c8-a77b-5d2db71af1a0" /> | <img width="768" alt="image" src="https://github.com/user-attachments/assets/71dcab6f-4237-4515-b942-e3531fb08899" /> </body> </html> ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #51279 from wangyum/SPARK-52574. Lead-authored-by: Yuming Wang <[email protected]> Co-authored-by: Yuming Wang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
1 parent e1d1302 commit d24c718

File tree

1 file changed

+42
-0
lines changed

1 file changed

+42
-0
lines changed

sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSuite.scala

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,10 +17,15 @@
1717

1818
package org.apache.spark.sql.hive
1919

20+
import java.io.File
2021
import java.time.{Duration, Period}
2122
import java.time.temporal.ChronoUnit
2223

24+
import org.apache.hadoop.fs.Path
25+
import org.apache.parquet.hadoop.ParquetFileReader
26+
2327
import org.apache.spark.sql.{AnalysisException, QueryTest, Row}
28+
import org.apache.spark.sql.catalyst.TableIdentifier
2429
import org.apache.spark.sql.execution.datasources.parquet.{ParquetCompressionCodec, ParquetTest}
2530
import org.apache.spark.sql.hive.test.TestHiveSingleton
2631
import org.apache.spark.sql.internal.SQLConf
@@ -179,4 +184,41 @@ class HiveParquetSuite extends QueryTest
179184
}
180185
}
181186
}
187+
188+
test("SPARK-52574: Ensure compression codec is correctly applied in Hive tables and dirs") {
189+
withSQLConf(
190+
HiveUtils.CONVERT_METASTORE_PARQUET.key -> "false",
191+
HiveUtils.CONVERT_METASTORE_INSERT_DIR.key -> "false",
192+
SQLConf.PARQUET_COMPRESSION.key -> ParquetCompressionCodec.SNAPPY.lowerCaseName()) {
193+
withTable("tbl") {
194+
sql("CREATE TABLE tbl(id int) STORED AS PARQUET")
195+
sql("INSERT INTO tbl SELECT id AS part FROM range(10)")
196+
val tblMata = spark.sessionState.catalog.getTableMetadata(TableIdentifier("tbl"))
197+
checkCompressionCodec(new File(tblMata.storage.locationUri.get))
198+
}
199+
200+
withTempPath { dir =>
201+
sql(
202+
s"""
203+
|INSERT OVERWRITE LOCAL DIRECTORY '${dir.getCanonicalPath}'
204+
|STORED AS parquet
205+
|SELECT id FROM range(10)
206+
|""".stripMargin)
207+
checkCompressionCodec(dir)
208+
}
209+
}
210+
211+
def checkCompressionCodec(dir: File): Unit = {
212+
val parquetFiles = dir.listFiles().filter(_.getName.startsWith("part-"))
213+
assert(parquetFiles.nonEmpty, "No Parquet files found")
214+
215+
val conf = spark.sessionState.newHadoopConf()
216+
val file = parquetFiles.head
217+
val footer = ParquetFileReader.readFooter(conf, new Path(file.getAbsolutePath))
218+
219+
val codec = footer.getBlocks.get(0).getColumns.get(0).getCodec.name()
220+
assert(codec.equalsIgnoreCase(ParquetCompressionCodec.SNAPPY.lowerCaseName()),
221+
s"Expected ${ParquetCompressionCodec.SNAPPY.lowerCaseName()} compression but found $codec")
222+
}
223+
}
182224
}

0 commit comments

Comments
 (0)