-
Notifications
You must be signed in to change notification settings - Fork 0
OpenCGA Catalog
A genomic data analysis platform need to keep track of different resources such as metadata of files, sample annotations or jobs. OpenCGA Catalog aims to collect and integrates all the information needed for executing genomic analysis, this information is organized in five main entities: users, studies, file, jobs and samples.
The main tasks of the Catalog are to provide:
- Authentication and authorization to resources such as files or samples
- Collaborative environment
- File audit to keep track of files and metadata
- Analysis and Jobs
- Sample annotation
- Security
All this information can be store and retrieved using a Java or a RESTful web services API.
In this section you can find a brief description of the most relevant entities. For more detailed information about the data models such as Java source code, examples or the JSON Schemas you can visit OpenCGA Catalog Data Models page.
The most relevant entities in OpenCGA Catalog are:
- User: Contains all the data related to a user account.
- Project: Group of related studies.
- Study: Main work space set. Contain files, jobs and samples.
- File: Any submitted or generated file
- Job: Analysis jobs executed with the files from the study.
- Sample: Related with the files
Catalog uses a pluggable mechanism for authentication. Depending on the requirements of the project and the configuration of the organization, the different pluggins available are:
- Catalog's own database
Catalog defines a set of Access Control Lists in different elements, providing a full customizable authorization mechanism. The elements that contains ACLs are: Project, Study, File and Sample. These ACLs are hierarchically redden in order to determine the permission from a user over some data. The authorization for a specific action over an element requires that all the elements in the hierarchy accepts the action. Entries on the ACL of an element relates a specific user with the actions he can do. There is a special ACL entry (default) that is applied when do not exist an entry for a user.
In order to preserve the consistency of the metadata over the set of files related to a project or a study, Catalog have to be able to have access to files. The tracking can be done by different ways:
-
Copy or Move data into the “Catalog main root”. From this moment, the data will be managed by Catalog, and all the file operations will be done by Catalog (move, renaming, etc…).
-
Synchronizing a study directory. When creating a Study, a study location URI can be defined. At the creation process, Catalog scans the folder and records the information to start tracking files. Any modification of this directory (and subdirectories) without notify Catalog will create an inconsistent state of the metadata.
-
Synchronizing a single file. 🚧 The most flexible (and expensive) way of synchronizing data is the single file tracking. Providing the URI for a file or directory, it can be included into Catalog. The only restrictions over this way of synchronizing are: The file have to be accessible to Catalog, and, if the file is renamed, moved or modified, Catalog have to be notified. 🚧
To specify the physical location of a file, Catalog uses URI. With the different data managers (IOManager), Catalog can manage resources under different file protocols (specified by the URI schema): file:// (PosixIOMAnager), 🚧 hdfs:// (HdfsIOManager), 🚧 i:// (IRodsIOManager), ...
All files can be used as input for analysis tools. Also, depending on some configurable thresholds, some basic read operations can be done using:
grepheadtail
Java and RESTful APIs microservice in the future MongoDB implementation (plugin postregsql) and collection schema