N3C spark extractor

Background - The n3c extractor script https://github.com/National-COVID-Cohort-Collaborative/Phenotype_Data_Acquisition can successfully export smaller cdm tables into CSVs. However, when the table sizes grow out of order e.g. measurement table having more than 2.5 billion rows it can result into timing out the bigquery calls.

This extractor can successfully extract huge tables within few minutes. It uses the cdm tables generated by the N3C script https://github.com/National-COVID-Cohort-Collaborative/Phenotype_Data_Acquisition/blob/master/Exporters/PythonExporter_bigquery/db_exp.py

To run this extractor -

[Optional] If you plan to NOT include the controls data, create a copy of //Phenotype_Data_Acquisition/PhenotypeScripts/N3C_phenotype_omop_bigquery.sql with name //Phenotype_Data_Acquisition/PhenotypeScripts/N3C_phenotype_omop_bigquery_no_controls.sql and update -
- insert into @resultsDatabaseSchema.n3c_cohort select distinct case_person_id as person_id from @resultsDatabaseSchema.n3c_control_map
  
  union distinct select distinct control_person_id from @resultsDatabaseSchema.n3c_control_map where control_person_id is not null; to be - insert into @resultsDatabaseSchema.n3c_cohort select distinct case_person_id as person_id from @resultsDatabaseSchema.n3c_control_map;
Run https://github.com/National-COVID-Cohort-Collaborative/Phenotype_Data_Acquisition/blob/master/Exporters/PythonExporter_bigquery/db_exp.py with command -
- [Optional] If you plan to NOT include the controls data, run command -
  - nohup python3 db_exp.py --debug --storedata --nocsv --database bigquery --config config.ini --output_dir --phenotype //Phenotype_Data_Acquisition/PhenotypeScripts/N3C_phenotype_omop_bigquery_no_controls.sql --extract //Phenotype_Data_Acquisition/ExtractScripts/N3C_extract_omop_bigquery.sql | tee <output_log>
- Otherwise run -
  - nohup python3 db_exp.py --debug --storedata --nocsv --database bigquery --config config.ini --output_dir --phenotype //Phenotype_Data_Acquisition/PhenotypeScripts/N3C_phenotype_omop_bigquery.sql --extract //Phenotype_Data_Acquisition/ExtractScripts/N3C_extract_omop_bigquery.sql | tee <output_log>
    - --storedata option will store the tables in subset dataset
    - --nocsv option ensures the cdm_table contents do not get extracted to csvs
Update the config.yaml file for your GCP environment i.e. set following configuration
- project_id: gcp project id where spark batches can be run
- service_account: Service account email (this is same iam account you used for running n3c scripts)
- region: gcp region where spark batches can be run
- subnetwork_uri: gcp subnet
- gcs_bucket: gcp bucket
- prefix: extractor where the temporary files, configuration files etc go and the spark batch can get the files from
- output_dir: local folder where the extractor can download CSV files to. NOTE: If executing on Windows host, use Windows path syntax (e.g., C:\output_dir). Do not use Linux syntax (/c/output_dir).
- additional_cdm_tables: OPTIONAL if additional_cdm_tables is not supplied, it will extract only - [person,observation_period,visit_occurrence,condition_occurrence,drug_exposure,device_exposure,procedure_occurrence,measurement,observation,death,location,care_site,provider,drug_era,condition_era], for any additonal cdm tables supply the remaining tables list and the sql for those tables (refer to config_example.yaml file). Only add the table name (e.g. note_nlp) to the select clause in config.yaml file. Do not include project_id.dataset to the SQL statements there. No spaces are allowed between table names in the addition_cdm_tables field.
Clone the repository -
- https://github.com/stanfordmed/n3c-spark-exporter
Build the module -
- python -m build
Install the package -
- pip install dist/n3c_spark_extractor-1.0.4.tar.gz
Run -
- n3c_spark_extractor --config config.yaml
Zip the CSVs and sftp as https://github.com/National-COVID-Cohort-Collaborative/Phenotype_Data_Acquisition/wiki/Instructions-for-Sites:-OMOP-Data-Model

NOTES - - NOTE: If executing on Windows host, use Windows path syntax (e.g., C:\output_dir). Do not use Linux syntax (/c/output_dir). - The timeout_in_min is set to 60 mins, the pyspark batch generally takes 15-20 minutes to complete. In future if it takes more than 60 minutes, config timeout_in_min can be set to a higher value. - If you happen to kill the n3c_spark_extractor process and if the spark job was already triggered, wait until it finishes before you run n3c_spark_extractor again DO NOT CHANGE configuration values in batch_config.yml

Version history - - 1.0.0 - Initial version - 1.0.1 - Fix for passing bigquery-jar file location in yaml file - 1.0.2 - Added a configuration delete_merged_csvs_from_bucket to disable deletion merged CSV files from GCP bucket - 1.0.3 - Removed timestamps in MANIFEST.csv and updated config_example.yaml - 1.0.4 - Changed column names in MANIFEST.csv to be usppercase

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.github/workflows		.github/workflows
src/n3c_spark_extractor		src/n3c_spark_extractor
tests		tests
.gitignore		.gitignore
CHANGELOG-dguan2-STAR-7237.adoc		CHANGELOG-dguan2-STAR-7237.adoc
LICENSE.txt		LICENSE.txt
README.md		README.md
config_example.yaml		config_example.yaml
pyproject.toml		pyproject.toml
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

N3C spark extractor

About

Releases

Packages

Contributors 4

Languages

License

stanfordmed/n3c-spark-exporter

Folders and files

Latest commit

History

Repository files navigation

N3C spark extractor

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages