-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-32614][SQL] Don't apply comment processing if 'comment' unset for CSV #29516
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -220,7 +220,9 @@ class CSVOptions( | |
format.setQuote(quote) | ||
format.setQuoteEscape(escape) | ||
charToEscapeQuoteEscaping.foreach(format.setCharToEscapeQuoteEscaping) | ||
format.setComment(comment) | ||
if (isCommentSet) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Arguably we should rework the handling of 'optional' configs to not use this default of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we will change that way then it might impact existing users for which \u0000 is a comment character by default. So I would say a separate optional config is a better solution. What I am saying here is that we need to wait for univocity 3.0.0 to be available where the new changes will be available then we can add spark changes in a proper manner. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You are correct, but, this has never been a valid comment character, and the flip side is the bug you describe: it's always a comment character. I think it's reasonable to fix as a bug. I don't think we need yet another config, as I think it would be quite obscure to use this non-printing control code for comments in a CSV file. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree, but once the changes will be done then \u0000 won't be treated as comment character. It will resolve this bug. But then default comment character will be # as in univocity this is the default comment character. So if my data row starts with # then will the row be processed now. If not then it will break most of the existing jobs. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree, I'll fix that in the next commit - we need to set the comment char to whatever Spark is using no matter what. However it looks like we are going to need your univocity fix to really fix this. Looks like that was just released in 2.9.0: uniVocity/univocity-parsers@f392311 let me try that. @dongjoon-hyun it is a correctness issue but I wouldn't hold up a release for it. We should address it but doesn't absolutely have to happen in 2.4.7 or 3.0.1. It's not a regression. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks, if you are fine I can also raise a PR for this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would think this is rather a bug fix. If comment is not set, it shouldn't assume anything else is a comment. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's also what we documented, see also There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right, yeah, so we have to use the new method in univocity 2.9.0 to turn off its comment handling if its unset in Spark (= There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh right, this stanza is for writer settings. There is no |
||
format.setComment(comment) | ||
} | ||
lineSeparatorInWrite.foreach(format.setLineSeparator) | ||
|
||
writerSettings.setIgnoreLeadingWhitespaces(ignoreLeadingWhiteSpaceFlagInWrite) | ||
|
@@ -242,7 +244,11 @@ class CSVOptions( | |
format.setQuoteEscape(escape) | ||
lineSeparator.foreach(format.setLineSeparator) | ||
charToEscapeQuoteEscaping.foreach(format.setCharToEscapeQuoteEscaping) | ||
format.setComment(comment) | ||
if (isCommentSet) { | ||
format.setComment(comment) | ||
} else { | ||
settings.setCommentProcessingEnabled(false) | ||
} | ||
|
||
settings.setIgnoreLeadingWhitespaces(ignoreLeadingWhiteSpaceInRead) | ||
settings.setIgnoreTrailingWhitespaces(ignoreTrailingWhiteSpaceInRead) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1902,25 +1902,26 @@ abstract class CSVSuite extends QueryTest with SharedSparkSession with TestCsvDa | |
|
||
test("SPARK-25387: bad input should not cause NPE") { | ||
val schema = StructType(StructField("a", IntegerType) :: Nil) | ||
val input = spark.createDataset(Seq("\u0000\u0000\u0001234")) | ||
val input = spark.createDataset(Seq("\u0001\u0000\u0001234")) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this test was wrong in 2 ways. First it relied on, actually, ignoring lines starting with |
||
|
||
checkAnswer(spark.read.schema(schema).csv(input), Row(null)) | ||
checkAnswer(spark.read.option("multiLine", true).schema(schema).csv(input), Row(null)) | ||
assert(spark.read.csv(input).collect().toSet == Set(Row())) | ||
assert(spark.read.schema(schema).csv(input).collect().toSet == Set(Row(null))) | ||
} | ||
|
||
test("SPARK-31261: bad csv input with `columnNameCorruptRecord` should not cause NPE") { | ||
val schema = StructType( | ||
StructField("a", IntegerType) :: StructField("_corrupt_record", StringType) :: Nil) | ||
val input = spark.createDataset(Seq("\u0000\u0000\u0001234")) | ||
val input = spark.createDataset(Seq("\u0001\u0000\u0001234")) | ||
|
||
checkAnswer( | ||
spark.read | ||
.option("columnNameOfCorruptRecord", "_corrupt_record") | ||
.schema(schema) | ||
.csv(input), | ||
Row(null, null)) | ||
assert(spark.read.csv(input).collect().toSet == Set(Row())) | ||
Row(null, "\u0001\u0000\u0001234")) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The other problem I think is that this was asserting there is no corrupt record -- no result at all -- when I think clearly the test should result in a single row with a corrupt record. |
||
assert(spark.read.schema(schema).csv(input).collect().toSet == | ||
Set(Row(null, "\u0001\u0000\u0001234"))) | ||
} | ||
|
||
test("field names of inferred schema shouldn't compare to the first row") { | ||
|
@@ -2366,6 +2367,17 @@ abstract class CSVSuite extends QueryTest with SharedSparkSession with TestCsvDa | |
} | ||
} | ||
|
||
test("SPARK-32614: don't treat rows starting with null char as comment") { | ||
withTempPath { path => | ||
Seq("\u0000foo", "bar", "baz").toDS.write.text(path.getCanonicalPath) | ||
val df = spark.read.format("csv") | ||
.option("header", "false") | ||
.option("inferSchema", "true") | ||
.load(path.getCanonicalPath) | ||
assert(df.count() == 3) | ||
} | ||
} | ||
|
||
test("case sensitivity of filters references") { | ||
Seq(true, false).foreach { filterPushdown => | ||
withSQLConf(SQLConf.CSV_FILTER_PUSHDOWN_ENABLED.key -> filterPushdown.toString) { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's correct to not trim the string that's checked to see if it starts with a comment, which is a slightly separate issue.
\u0000
can't be used as a comment char, but other non-printable chars could.