OpenCGA Catalog

Overview

A genomic data analysis platform need to keep track of different resources such as metadata of files, sample annotations or jobs. OpenCGA Catalog aims to collect and integrates all the information needed for executing genomic analysis, this information is organized in five main entities: users, studies, file, jobs and samples.

The main tasks of the Catalog are to provide:

Authentication and authorization to resources such as files or samples
Collaborative environment
File audit to keep track of files and metadata
Analysis and Jobs
Sample annotation
Security

All this information can be store and retrieved using a Java or a RESTful web services API.

Data Models (version 0.5)

In this section you can find a brief description of the most relevant entities. For more detailed information about the data models such as Java source code, examples or the JSON Schemas you can visit OpenCGA Catalog Data Models page.

The most relevant entities in OpenCGA Catalog are:

User: Contains all the data related to a user account.
Project: Group of related studies.
Study: Main work space set. Contain files, jobs and samples.
File: Any submitted or generated file
Job: Analysis jobs executed with the files from the study.
Sample: Related with the files

Authentication

Catalog uses a pluggable mechanism for authentication. Depending on the requirements of the project and the configuration of the organization, the different pluggins available are:

Catalog's own database

Authorization

Catalog defines a set of Access Control Lists in different elements, providing a full customizable authorization mechanism. The elements that contains ACLs are: Project, Study, File and Sample. These ACLs are hierarchically redden in order to determine the permission from a user over some data. The authorization for a specific action over an element requires that all the elements in the hierarchy accepts the action. Entries on the ACL of an element relates a specific user with the actions he can do. There is a special ACL entry (default) that is applied when do not exist an entry for a user.

File management

In order to preserve the consistency of the metadata over the set of files related to a project or a study, Catalog have to be able to have access to files. The tracking can be done by different ways:

Copy or Move data into the “Catalog main root”. From this moment, the data will be managed by Catalog, and all the file operations will be done by Catalog (move, renaming, etc…).
Synchronizing a study directory. When creating a Study, a study location URI can be defined. At the creation process, Catalog scans the folder and records the information to start tracking files. Any modification of this directory (and subdirectories) without notify Catalog will create an inconsistent state of the metadata.
Synchronizing a single file. 🚧 The most flexible (and expensive) way of synchronizing data is the single file tracking. Providing the URI for a file or directory, it can be included into Catalog. The only restrictions over this way of synchronizing are: The file have to be accessible to Catalog, and, if the file is renamed, moved or modified, Catalog have to be notified. 🚧

To specify the physical location of a file, Catalog uses URI. With the different data managers (IOManager), Catalog can manage resources under different file protocols (specified by the URI schema): file:// (PosixIOMAnager), 🚧 hdfs:// (HdfsIOManager), 🚧 i:// (IRodsIOManager), ...

Read files

All files can be used as input for analysis tools. Also, depending on some configurable thresholds, some basic read operations can be done using:

grep
head
tail

File life cycle

Sample annotation

Implementation

Java and RESTful APIs microservice in the future MongoDB implementation (plugin postregsql) and collection schema

OpenCGA is an open source project and it is freely available.

General

Documentation

Tutorials

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenCGA Catalog

Overview

Data Models (version 0.5)

Authentication

Authorization

File management

Read files

File life cycle

Sample annotation

Implementation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally