-
Notifications
You must be signed in to change notification settings - Fork 781
DefaultCacheStore: misleading error + no escape hatch when launch dir is on NFS/Lustre/GPFS #6996
Description
Bug report
When Nextflow is launched from a directory on a network filesystem that does not support POSIX file locking (NFS, Lustre, GPFS, BeeGFS, …), opening the LevelDB cache database fails with:
ERROR ~ Can't open cache DB: /path/to/launchDir/.nextflow/cache/<uuid>/db
Nextflow needs to be executed in a shared file system that supports file locks.
Alternatively, you can run it in a local directory and specify the shared work
directory by using the "-w" command line option.
This error manifests in `DefaultCacheStore.openDb()` (`DefaultCacheStore.groovy`) when `Iq80DBFactory.open()` throws an exception that is not the known `"Unable to acquire lock"` variant.
Root Cause
`DefaultCacheStore` always places the LevelDB cache directory under `.nextflow/cache/` relative to the pipeline launch directory (`Const.appCacheDir`). On HPC clusters, the launch directory is typically on a shared filesystem (NFS, Lustre, etc.) that does not support the file locking mechanisms LevelDB requires.
Three concrete problems exist today:
1. The underlying exception is swallowed silently
The caught `Exception e` is attached as a cause to the thrown `IOException` but is never logged. Developers and users looking at `.nextflow.log` cannot see what LevelDB actually reported, making diagnosis unnecessarily difficult.
// current code – e is never logged
throw new IOException(msg, e)2. No escape hatch when the launch directory cannot be moved
The only documented workaround is to run Nextflow from a local directory and use `-w` to redirect the work directory. However, `-w` only controls where task work directories are created — it does not move the `.nextflow/cache/` directory. Users who must launch from a shared path (e.g. a project directory enforced by HPC policies) have no supported way to redirect just the cache.
3. The error message is misleading
The message tells users to use `-w`, which does not actually solve the underlying problem (the cache DB is still on the shared filesystem). Users following this advice will hit the same error.
Proposed Fixes
Fix 1 – Log the root cause (diagnostic)
Log the caught exception at `DEBUG` level in the `else` branch so it always appears in `.nextflow.log`:
log.debug \"Failed to open LevelDB cache at path: \$file -- cause: \${e.message}\", eFix 2 – Add `NXF_CACHE_DIR` env-var support
Introduce a `resolveCacheBaseDir()` helper that checks the `NXF_CACHE_DIR` environment variable. When set, the variable overrides the default `.nextflow/` path, allowing users to redirect only the cache DB to a lock-capable local filesystem without moving the launch directory:
export NXF_CACHE_DIR=/tmp/nxf-cache-\$USER
nextflow run pipeline.nf -w /projects/exome/workFix 3 – Rewrite the error message
Replace the current misleading message with one that:
- accurately names the problem (cache DB on a lock-incapable filesystem)
- lists both remedies: run from a local dir with `-w`, or set `NXF_CACHE_DIR`
Affected file
`modules/nextflow/src/main/groovy/nextflow/cache/DefaultCacheStore.groovy`
Environment
Reproducible whenever the pipeline is launched from NFS / Lustre / GPFS. Common in HPC environments where project directories live on a shared parallel filesystem.
Proposed implementation
A reference implementation of all three fixes is available at:
https://github.com/matthdsm/nextflow/tree/fix/cache-db-nfs-error"