Skip to content

Latest commit

 

History

History
127 lines (114 loc) · 19 KB

metadata.md

File metadata and controls

127 lines (114 loc) · 19 KB

Metadata

How to organize the dataset and the related metadata?

  1. 数据集由符合指定规范的Metadata表格文件组成,其中Level1-3数据文件存放在数据仓库(GSA/NODE/SRA/ENA的统称)并以链接形式记录在Metadata表格中;

  2. 实体、属性与关系的设计需要在数据采集清理难度与尽可能还原真实之间进行权衡;

  3. 数据上传时多采用TSV/CSV/JSON/XML格式文件,尤其是前两者最常用,对用户最友好;

  4. 实体间关系多数属于一对一或一对多,因此,可以考虑采用以下方式管理元数据文件,而对于存在多对多关系的实体则特殊处理;

    # 目录结构,每个实体一个目录,每个目录下放置N个Batch文件,每个Batch包含100条左右记录
    |- project
    |    |- project-2021-03-18.csv
    |- donor (Each file contains a project_id to be associate to the project file. 100 records/Batch)
    |    |- donor-2021-03-18.csv (Batch 1)
    |    |- donor-2021-03-19.csv (Batch 2)
    

    ❗ 哪些实体间可能存在多对多关系?(待讨论) ❗

    ❗ 如何通过TSV/CSV文件来记录多对多关系?(待讨论) ❗

  5. Metabase管理Metadata时,采用实体表合并表共存的策略来平衡易用性与录入便利性

    策略:在每一个实体对应一个表的方案不变的基础上,依据常见需求由数据导入程序自动生成合并表,如由所有实体子表合并而成的总表(表名为full_metadata)

    # 导入程序处理后
     |- [实体子表] project.csv
     |- [实体子表] donor.csv
     |- [总表] full_metadata.csv
    

The Entities

Project

Donor

Biospecimen

Reference Materials

Library

Sequencing

DataFile

Relationship Between the Entities

flowchart LR
      project([Project]) -.-> donor[Donor]
      project --> reference_materials[Reference Materials]
      donor -.-> biospecimen
      biospecimen -.-> reference_materials
      reference_materials --> library[Library]
      library --> sequencing[Sequencing]
      sequencing --> datafile[Data File]
Loading

Legacy Specification

key name short description type collection from
project_id Project Id Project Id Identity of the project. category quartet project
project_name Project Name Project Name Name of the project. category quartet project
project_type Project Type Project Type Type of the project. category quartet project
project_description Project Description Project Description Description of the project. category quartet project
investigator_name Investigator Name Pi Name Name of the investigator. category quartet project
investigator_affiliation Investigator Affiliation Pi Aff Affiliation of the investigator. category quartet project
support_id Support Id Support Id ID of the project funded. category quartet project
support_source Support Source Support Source Source of the project funded. category quartet project
date_collected Date Collected Date Collect Collection date. number quartet project
availability_type Availability Type Avail Type Data privacy. category quartet project
donor_id Donor Id Donor Id Identity of the donor. category quartet donor
family_id Family Id Family Id Identity of the family. category quartet donor
pedigree Pedigree Pedigree Pedigree of the family category quartet donor
gender Gender Gender Gender of the donor. category quartet donor
birth_date Birth Date Birthday Birthday of the donor. number quartet donor
biospecimen_id Biospecimen Id Biospecimen Id Identity of the biospecimen. category quartet biospecimen
biospecimen_name Biospecimen Name Biospecimen Name Name of the biospecimen. category quartet biospecimen
biospecimen_type Biospecimen Type Biospecimen Type Type of the biospecimen. category quartet biospecimen
collection_date Collection Date Collect Date Sample collection date. number quartet biospecimen
rm_id Rm Id Rm Id Identity of the RM. category quartet reference_materials
extraction_site Extraction Site Extract Site Site of the extraction. category quartet reference_materials
lot_no Lot No Lot No Number of batch. number quartet reference_materials
cell_line_passage_number Cell Line Passage Number Clp Number Number of cell line passage. number quartet reference_materials
rm_type Rm Type Rm Type Type of the RM. category quartet reference_materials
source Source Source Source of the sample.(eg. Blood, cell, etc.) category quartet reference_materials
cell_collection_date Cell Collection Date Cell Collect Date Data of cell collcollectionction. number quartet reference_materials
extraction_date Extraction Date Extract Date Data of materials extraction. number quartet reference_materials
kit_cat_no Kit Cat No Kit Cat No Number kit cat. number quartet reference_materials
kit_lot_no Lot Lot No Kit Cat Number kit lot. number quartet reference_materials
extraction_protocol Extraction Extract Protocol of extraction. number quartet reference_materials
library_id Library Id Library Id Identity of the library. number quartet library
input_ng Input Ng Input Ng Concentration of input. precision quartet library
enrich_kit Enrich Kit Enrich Kit Kit of library enrichment. category quartet library
preparation_kit Preparation Kit Prep Kit Kit of preperation. category quartet library
fragment_method Fragment Method Fragment Method Method of the fragment. category quartet library
fragment_selection Fragment Selection Fragment Select Selection of the fragment. category quartet library
fragment_range Fragment Range Fragment Range Range of the fragment. precision quartet library
pcr_cycle Pcr Cycle Pcr Cycle Cycle of PCR number quartet library
preparation_date Preparation Date Prep Date Date of preperation. number quartet library
spike_in Spike In Spike In Spike_in added in the library construction.PreparationpreparationPreparationpreparation category quartet library
qc_concentration Qc Concentration Qc Conc Concentration of quality control. precision quartet library
qc_size Qc Size Qc Size Size of quality control. precision quartet library
preparation_method Preperation Method Prep Method Method of preparation. category quartet library
preparation_site Preparation Site Prep Site Site of preparation category quartet library
batch Library Batch Lib Batch Batch of library category quartet library
stranded Library Type Lib Type Type of library category quartet library
sequencing_id Sequencing Id Seq Id ID of sequencing. number quartet sequencing
site Site Site Site of sequencing. category quartet sequencing
platform Platform Platform Platform of sequencing. category quartet sequencing
method Method Method Method of sequencing. category quartet sequencing
index_sequence Index Sequence Index Seq Index of sequencing. category quartet sequencing
flowcell_id Flowcell Id Flowcell Id Identity of the flowcell. category quartet sequencing
lane_no Lane No Lane No Number of the lane. number quartet sequencing
run_date Run Date Run Date Date of process running. number quartet sequencing
datafile_id Datafile Id Datafile Id Identity of the data file. category quartet datafile
submitter_id Submitter Id Submitter Id Identity of the submitter. category quartet datafile
data_type Data Type Data Type Type of the data file. category quartet datafile
data_category Data Category Data Cat Category of the data file. category quartet datafile
data_format Data Format Data Format Format of the data file. category quartet datafile
file_name File Name File Name Name of the data file. category quartet datafile
file_size File Size File Size Size of the datafile. precision quartet datafile
md_5sum Md5Sum Md5 The 128-bit hash value expressed as a 32 digit hexadecimal number (in lower case) used as a file's digital fingerprint. category quartet datafile
file_path File Path File Path Path of the date file. category quartet datafile
node Node Id Node Id Identity of the Node URL. category quartet datafile
analyses_id Analyses Id Analyze Id Identity of the analyses. category quartet analyses
analyses_type Analyses Type Analyze Type Type of the analyses. category quartet analyses
link Link Link Link of the analyses. category quartet analyses
version Version Version Version of the analyses. category quartet analyses