This project provides a command-line interface for running data processing pipelines. The main entry point is main.py, which allows you to execute various pipelines with configurable options.
- Python 3.12 or higher
- Required dependencies (see
requirements.txtorpyproject.toml)
First if using pip create a virtual environment:
python -m venv .venvIf using uv the virtual environment will be manged by the tool.
Install the project dependencies:
Using pip:
source .venv/bin/activate # Activate the virtual env
pip install -r requirements.txt # Install the dependenciesor uv:
uv syncThis script requires an Open AI key to use the agentic classifer. First copy the tpl.dotenv file to .env and set this key as follows:
OPENAI_API_KEY=<your_api_key>The script is executed using Python's -m flag or by running main.py directly:
python main.py python main.py [PIPELINE_NAME] [OPTIONS]- PIPELINE_NAME (optional): The name of the pipeline to run. Defaults to
open_neuroif not specified. Note: this is the only pipeline that exits at this time.- Available pipelines:
open_neuro
- Available pipelines:
-
--output-dir (optional): Directory where pipeline output will be stored. Defaults to
output/.- Example:
--output-dir /path/to/output
- Example:
-
--clean (optional): Run the pipeline without using any precomputed or previous output. When this flag is set, the pipeline will perform a fresh run from scratch.
- Example:
--clean
- Example:
The pipeline consists of 4 steps:
- ETL: This step will call the OpenNeuro API and extract the dataset meta, transform it into DATS and save the contents as jsonl.
- Publication Extraction: Uses the dataset identifers (title and dataset id) to search PMC eSearch for publications that include these terms. Then downloads their full text to the
output/full_textdirectory. Also stores the list of mentions todataset_publication_mentions.csv - agentic classifer: For each dataset in
dataset_publication_mentions.csv, extracts the sections mentioning a dataset from the full text and passes it to the agent to classify the usage type. Outputs results todataset_usage_classification.csv. - S-index calculation: Uses the dataset metadata and agent classifications to compute each authors S-index. See Final S-index output for output.
Run a specific pipeline:
python main.py open_neuroSpecify a custom output directory:
python main.py --output-dir resultsRun with clean mode (no precomputed data):
python main.py --cleanLimit the number of full text publications to fetch per dataset:
python main.py --full-text-limit=10Combine options:
python main.py open_neuro --output-dir custom_output --cleanWe understand steps of this pipeline will require a great deal of time to run. To circumvent this, the script will prompt you if you want to reuse our
precomputed outputs found in the /data dir.
These precomputed outputs will work if you use the precomputed outputs for steps 1-3. However, if you want to run the classifer with precomputed ouputs for step 2 you will require the full PMC archive of papers which totals around 650 GB. This is not recommended but offered for compeleteness. Please refer to dataset_publication_mentions.csv for info on where to extract the xml documents.
In summary, to retrieve the same S-index calculations uploaded to our application, use the precomputed outputs for steps 1-3.
To view available commands and options:
python main.py --helpPipeline results are written to the specified output directory (default: output/). The structure will be as follows:
output
├── open_neuro
│ ├── etl
│ │ ├── raw
│ │ ├── raw
│ │ │ └── 0.raw.jsonl
│ │ └── transformed
│ │ │ └── 0.transformed.jsonl
│ ├── full_text
│ │ ├── PMC_1.xml
│ │ ├── PMC_2.xml
│ │ ├── PMC_3.xml
│ │ ├── ...
│ │ └── PMC_N.xml
│ ├── dataset_publication_mentions.csv
│ ├── dataset_usage_classification.csv
└── └── s_index_final_output.jsonlThe output directory contains the following components:
open_neuro/: Pipeline-specific output directory for the OpenNeuro pipelineetl/: ETL (Extract, Transform, Load) processing directoryraw/: Contains raw extracted data in JSONL format (0.raw.jsonl), representing the initial unprocessed data extracted from the source.transformed/: Contains transformed/processed data (0.transformed.jsonl), representing cleaned and normalized data after transformation steps in DATS format.
full_text/: Directory containing full-text XML files from PubMed Central (PMC). Each file represents a complete publication in XML format, identified by its PMC ID.dataset_publication_mentions.csv: A CSV file that maps datasets to their associated publication mentions, linking datasets with the publications that reference them.dataset_usage_classification.csv: Contains the dataset id, the usage classifcation and the PMC id of the publication mentioning it.s_index_final_output.jsonl: The final processed output file in JSONL (JSON Lines) format, containing the complete S-Index data ready for use.
The final output s_index_final_output.jsonl contains the S-index for each author around with the number of publication mentions usages, dataset IDS and PMC_IDs:
{
"name": "Tom Schonberg",
"s_index": 2,
"datasets": [
{
"dataset_id": "ds003782",
"used": 0,
"mentioned": 0
},
{
"dataset_id": "ds000001",
"used": 3,
"mentioned": 1,
"pmc_ids": {
"used": [
"PMC6618324",
"PMC7927288",
"PMC8764489"
],
"mentioned": [
"PMC3366349"
]
}
},
{
"dataset_id": "ds001417",
"used": 0,
"mentioned": 0,
"pmc_ids": {
"used": [],
"mentioned": []
}
},
{
"dataset_id": "ds001734",
"used": 5,
"mentioned": 1,
"pmc_ids": {
"used": [
"PMC12319750",
"PMC12230699",
"PMC12247560",
"PMC11627503",
"PMC7578785"
],
"mentioned": [
"PMC7771346"
]
}
}
]
}To review our pipeline run the script and accept precomputed outputs for steps 1-3. This will create the S-index calculations for all of OpenNeuro.
For a rigious test, run the pipeline without precomputed outputs:
python main.py --cleanNote: this will take hours to complete. On average the ETL step takes 30m, publication extraction ~1-2days (with no limit set) and classifcation ~2hrs. To save on time, run with the --full-text-limit tag to reduce the number of publications fetched per dataset.
During the ETL step, a few server side issues with the OpenNeuro API may present themselves. These are outside of our control, but we decided to document them here in case you face similar issues.
Rarely, the script may fail with a cryptic gql error. If encountered please cancel and re-run the script. If this happens repeatedly, try running at a differnt time.
More commonly, you may see a error message pop up saying Failed on cursor eyJvZ.... along with a large json output. This can be ignored long as you see the message Found 1100 datasets proceeding it.
Additionally, you may see output saying Found 100 datasets were identifed. This happens when the cursor to the next page request failed and only the first 100 results were fetched. As before, reruning the script should resolve the issue and show the correct number of datasets (which is around 1100).