This Nextflow pipeline uploads files and directories to iRODS storage with comprehensive metadata management. The pipeline supports three main operations: file upload with automatic checksum verification, metadata attachment to existing iRODS collections, and metadata retrieval from iRODS collections.
main.nf— the main Nextflow pipeline that orchestrates file uploads and metadata management with iRODSnextflow.config— configuration script for IBM LSF submission on Sanger's HPC with Singularity containers and global parametersconfigs/— configuration files for individual pipeline modulesmodules/local/irods/storefile/— module for uploading files to iRODS with checksum verificationmodules/local/irods/attachmetadata/— module for attaching metadata to iRODS collectionsmodules/local/irods/getmetadata/— module for retrieving metadata from iRODS collectionsmodules/local/irods/aggregatemetadata/— module for aggregating retrieved metadatamodules/local/csv/concat/— module for concatenating CSV filesexamples/— example input files for different pipeline operationstests/— test data and configurations for pipeline validation
- File Discovery: Reads file/directory information from CSV input file
- Path Classification: Distinguishes between individual files and directories for processing
- File Collection: For directories, recursively gathers all files within the directory structure
- File Filtering: Applies ignore patterns to exclude specified file types from upload
- iRODS Upload: Transfers files to iRODS with MD5 checksum verification
- Metadata Attachment: Attaches custom metadata to iRODS collections (separate operation)
- Metadata Retrieval: Retrieves metadata from existing iRODS collections (separate operation)
--upload— Path to a CSV file containing upload information with columns:path(local filesystem path) andirodspath(target iRODS path) OR--attach_metadata— Path to a CSV or JSON file containing metadata information with columns:irodspath(target iRODS path) and additional metadata key-value pairs OR--get_metadata— Path to a CSV file containing iRODS paths with column:irodspath(iRODS path to retrieve metadata from)
--output_dir— Output directory for pipeline results (default: "results")--publish_mode— File publishing mode (default: "copy")--ignore_ext— Comma-separated list of file extensions to ignore during upload (default: null)--remove_existing_metadata— Remove existing metadata before adding new metadata (default: false)--dup_meta_separator— Separator for splitting multiple values in metadata fields (default: ";")--metadata_index_name— Column name for iRODS path in metadata files (default: "irodspath")--join— Join method for metadata operations (default: "outer")--verbose— Enable verbose output for detailed logging (default: false)
The pipeline supports three distinct operation modes:
CSV file with the following structure:
path,irodspath
/path/to/local/file.txt,/archive/cellgeni/target/file.txt
/path/to/local/directory,/archive/cellgeni/target/directory
/path/to/another/file.csv,/archive/cellgeni/target/data.csv
Where:
path: Absolute path to the local file or directory to uploadirodspath: Target path in iRODS where the file/directory should be stored
CSV file with the following structure:
irodspath,meta1,meta2,meta3
/archive/cellgeni/target/collection1,value1,value2,value3
/archive/cellgeni/target/collection2,value4,value5,value6
/archive/cellgeni/target/collection3,value7,value8,value9
JSON file with the following structure:
[
{
"irodspath": "/archive/cellgeni/target/collection1",
"meta1": "value1",
"meta2": "value2",
"meta3": "value3"
},
{
"irodspath": "/archive/cellgeni/target/collection2",
"meta1": "value4",
"meta2": "value5",
"meta3": "value6"
}
]Where:
irodspath: Target iRODS collection path for metadata attachment- Additional fields: Custom metadata key-value pairs to attach to the collection
- Values can contain multiple entries separated by the
--dup_meta_separator(default: ";") which will create separate metadata entries for each value
CSV file with the following structure:
irodspath
/archive/cellgeni/collection1
/archive/cellgeni/collection2
/archive/cellgeni/collection3
Where:
irodspath: iRODS collection path to retrieve metadata from
Files are uploaded directly to the specified iRODS path:
- Local:
/path/to/file.txt - iRODS:
/archive/cellgeni/target/file.txt
All files within the directory are uploaded while preserving the directory structure:
- Local:
/path/to/directory/subdir/file.txt - iRODS:
/archive/cellgeni/target/directory/subdir/file.txt
When --ignore_ext is specified, files containing the specified extensions are excluded from upload:
--ignore_ext ".bam,.fastq.gz,.tmp"When attaching metadata, values can contain multiple entries separated by a delimiter (default: ";"). Each entry will create a separate metadata attribute with the same key:
Input CSV:
irodspath,authors,keywords
/archive/collection1,"John Doe;Jane Smith","genomics;analysis"
Result: Creates four separate metadata entries:
authors = John Doeauthors = Jane Smithkeywords = genomicskeywords = analysis
Upload files and directories to iRODS:
nextflow run main.nf --upload upload.csvUpload files while excluding specific file types:
nextflow run main.nf \
--upload upload.csv \
--ignore_ext ".bam,.fastq.gz,.tmp"Attach metadata to existing iRODS collections:
nextflow run main.nf --attach_metadata metadata.csvAttach metadata using JSON format:
nextflow run main.nf --attach_metadata metadata.jsonRemove existing metadata before adding new metadata:
nextflow run main.nf \
--attach_metadata metadata.csv \
--remove_existing_metadataRetrieve metadata from existing iRODS collections:
nextflow run main.nf --get_metadata get_metadata.csvGet detailed logging information during upload:
nextflow run main.nf \
--upload upload.csv \
--verboseSpecify a different output directory for results:
nextflow run main.nf \
--upload upload.csv \
--output_dir "my_results"Use semicolons to split multiple values in metadata fields:
nextflow run main.nf \
--attach_metadata metadata.csv \
--dup_meta_separator ";"Use a different column name for iRODS paths:
nextflow run main.nf \
--attach_metadata metadata.csv \
--metadata_index_name "irods_collection_path"The pipeline accepts any file or directory structure. Common use cases include:
Individual files:
/path/to/data.csv
/path/to/analysis.txt
/path/to/results.json
Directory structures:
/path/to/experiment/
├── sample1/
│ ├── data.txt
│ ├── results.csv
│ └── analysis/
│ └── output.json
├── sample2/
│ └── data.txt
└── metadata.tsv
Metadata is attached to existing iRODS collections. The collections should already exist in iRODS before running the metadata attachment operation.
Metadata is retrieved from existing iRODS collections. The collections should already exist in iRODS before running the metadata retrieval operation.
- MD5 checksums file:
{output_dir}/md5sums.csv- Contains MD5 checksums for all uploaded files
- Includes both local and iRODS checksums for verification
- Format:
collection_id,filepath,irodspath,md5,irodsmd5
- Metadata is directly attached to iRODS collections
- No local output files are generated
- Metadata file:
{output_dir}/metadata.csv- Contains retrieved metadata from specified iRODS collections
- Aggregated metadata from all queried collections
- Files are transferred to iRODS using the
iputcommand - MD5 checksums are calculated for both local and iRODS copies
- Checksums are compared to ensure data integrity
- Upload results are logged and saved to CSV format
- Metadata key-value pairs are extracted from the input CSV or JSON file
- If
--remove_existing_metadatais enabled, existing metadata is removed first - Each metadata attribute is attached to the specified iRODS collection
- Existing metadata can be updated or new metadata can be added
- iRODS collection paths are read from the input CSV file
- Metadata is retrieved from each specified iRODS collection using
imetacommands - Retrieved metadata is aggregated and formatted into a consolidated CSV file
- The final metadata file is saved to the output directory
- Nextflow: Version 25.04.4 or higher
- Singularity: For containerized execution
- iRODS client: Access to iRODS commands (
iput,imeta, etc.) - LSF: For job submission on HPC clusters (configured for Sanger's environment)
- Python: Python 3.x with pandas for metadata aggregation operations
The pipeline includes comprehensive testing infrastructure:
- nf-test: Testing framework for Nextflow modules and workflows
- Test data: Example files located in
tests/directory - Module tests: Individual module testing in
modules/*/tests/directories - Example files: Sample input files in
examples/directory for each operation mode
To run tests:
nf-test test- File not found: Pipeline will fail if specified local files/directories don't exist
- iRODS connection: Pipeline will retry failed iRODS operations up to 5 times, then ignore on final failure
- Checksum mismatch: Upload failures are reported in the output logs
- Invalid CSV format: Pipeline validates CSV headers and structure
- Empty metadata: Modules handle empty metadata gracefully with appropriate warnings
- Path resolution: Automatic detection of iRODS collections vs data objects, including symbolic links
The pipeline generates comprehensive reports in the reports/ directory:
- Timeline report: Visual timeline of task execution
- Execution report: Detailed resource usage and performance metrics
- Trace file: Complete execution trace for debugging
All temporary work files are stored in nf-work/ directory and can be cleaned up after successful execution.
- Only one operation mode can be used per pipeline run (
--uploadOR--attach_metadataOR--get_metadata) - File paths must be absolute paths to avoid ambiguity
- iRODS collections for metadata attachment and retrieval must exist before running the pipeline
- Metadata files can be in either CSV or JSON format
- When using
--remove_existing_metadata, all existing metadata will be removed before adding new metadata - Large file uploads may take considerable time depending on network bandwidth
- The pipeline is optimized for batch operations rather than single file transfers
- Configuration files in
configs/directory allow fine-tuning of individual modules - The pipeline uses Singularity containers with specific images for Python-based operations
- All modules include comprehensive metadata documentation in
meta.ymlfiles