Summary
When reading a Lance dataset by path with storage credentials supplied via the
DataFrame reader .option(...) API, the credentials are silently dropped before the
dataset is opened. The native object store then falls back to its default credential
chain — for Azure, the IMDS / Managed Identity endpoint — and the read fails on any
environment without a managed identity.
Repro
spark.read.format("lance")
.option("azure_storage_sas_token", "<sas>")
.option("azure_storage_account_name", "<account>")
.load("abfss://<fs>@<account>.dfs.core.windows.net/path/to/ds.lance")
.show()
No spark.sql.catalog.* configured, and the SAS is not present as an OS env var.
Observed
LanceError(IO): Generic MicrosoftAzure error: Error performing token request:
Error performing GET http://169.254.169.254/metadata/identity/oauth2/token
?api-version=2019-08-01&resource=https%3A%2F%2Fstorage.azure.com ... after 3 retries
at org.lance.Dataset.openNative(Native Method)
at org.lance.spark.utils.Utils$OpenDatasetBuilder.build(Utils.java:...)
at org.lance.spark.BaseLanceNamespaceSparkCatalog.loadTableFromPath(...)
Root cause
LanceDataSource implements SupportsCatalogOptions. On a path-based read, Spark calls
extractIdentifier(options) and then catalog.loadTable(identifier) — and only the
Identifier is forwarded; the per-read option map is not.
extractIdentifier builds new LanceIdentifier(readOptions.getDatasetUri()), keeping
only the location and discarding the options.
loadTableFromPath therefore rebuilds storage options from catalogConfig alone,
which is empty when credentials were supplied per-read.
- The resulting credential-less
Dataset.open falls through to object_store's default
Azure credential chain → IMDS/MSI → the error above.
This is not Azure-specific — the same drop affects any per-.option() storage credential
(S3 keys, GCS, etc.) on the path-based read. It is masked if the credential also happens
to be present as an OS environment variable, since lance core reads those as a fallback.
Scope / not covered by existing PRs
Proposed fix
Carry the per-read options on LanceIdentifier and thread them through
loadTableFromPath so they reach the native open, with per-read options overriding
catalog-level storage options. PR to follow.
Environment
- lance-spark
lance-spark-base (reproduced on current main)
- Spark 3.5, Scala 2.12
Summary
When reading a Lance dataset by path with storage credentials supplied via the
DataFrame reader
.option(...)API, the credentials are silently dropped before thedataset is opened. The native object store then falls back to its default credential
chain — for Azure, the IMDS / Managed Identity endpoint — and the read fails on any
environment without a managed identity.
Repro
No
spark.sql.catalog.*configured, and the SAS is not present as an OS env var.Observed
Root cause
LanceDataSourceimplementsSupportsCatalogOptions. On a path-based read, Spark callsextractIdentifier(options)and thencatalog.loadTable(identifier)— and only theIdentifieris forwarded; the per-read option map is not.extractIdentifierbuildsnew LanceIdentifier(readOptions.getDatasetUri()), keepingonly the location and discarding the options.
loadTableFromPaththerefore rebuilds storage options fromcatalogConfigalone,which is empty when credentials were supplied per-read.
Dataset.openfalls through to object_store's defaultAzure credential chain → IMDS/MSI → the error above.
This is not Azure-specific — the same drop affects any per-
.option()storage credential(S3 keys, GCS, etc.) on the path-based read. It is masked if the credential also happens
to be present as an OS environment variable, since lance core reads those as a fallback.
Scope / not covered by existing PRs
describeTable) intoread options, and renames the path-based calls to
createPathBasedReadOptions(...), butthose calls still pass no per-read options — so this path-based
.option()caseremains broken after fix(catalog): propagate namespace storage options into read options #522.
Proposed fix
Carry the per-read options on
LanceIdentifierand thread them throughloadTableFromPathso they reach the native open, with per-read options overridingcatalog-level storage options. PR to follow.
Environment
lance-spark-base(reproduced on currentmain)