-
Notifications
You must be signed in to change notification settings - Fork 513
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disable fs.gs.inputstream.fast.fail.on.not.found.enable by default #5510
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #5510 +/- ##
==========================================
- Coverage 61.42% 61.40% -0.02%
==========================================
Files 312 312
Lines 11106 11106
Branches 771 771
==========================================
- Hits 6822 6820 -2
- Misses 4284 4286 +2 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still require this file ? IIRC to use parquet with scio-smb
we need scio-parquet
where we define the same settings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great point actually - I think it can be deleted
Hmm, on running some tests w/ debugger enabled, I'm still seeing |
Small update: I discovered that gcs-connector options were not getting propagated to any Parquet reads that use SplittableDoFn API. I've contributed a workaround that will be included in the upcoming Beam 2.62.0 release (we'll need to set default options in Scio as well). |
This should speed up ParquetReader initialization by skipping a GCS StorageObject lookup that validates that the file exists before trying to open a ReadChannel. Per the doc:
This is redundant, since ParquetRead executes
FileIO.readMatches
before opening any ParquetReaders, which will throw an error on nonexistent files anyway; I tested this with Dataflow and confirmed:Therefore, we know we'll always invoke ParquetReader with an existent file path. Since this metadata lookup can be quite expensive threadpool-wise, so let's skip it since we don't need it.