Skip to content

BIDS-Xu-Lab/s-index-pipeline

Repository files navigation

S-Index Pipeline

This project provides a command-line interface for running data processing pipelines. The main entry point is main.py, which allows you to execute various pipelines with configurable options.

Prerequisites

  • Python 3.12 or higher
  • Required dependencies (see requirements.txt or pyproject.toml)

Installation

First if using pip create a virtual environment:

python -m venv .venv

If using uv the virtual environment will be manged by the tool.

Install the project dependencies:

Using pip:

source .venv/bin/activate # Activate the virtual env
pip install -r requirements.txt # Install the dependencies

or uv:

uv sync

Setting up environment variables

This script requires an Open AI key to use the agentic classifer. First copy the tpl.dotenv file to .env and set this key as follows:

OPENAI_API_KEY=<your_api_key>

Usage

The script is executed using Python's -m flag or by running main.py directly:

python main.py 

Command Syntax

python main.py [PIPELINE_NAME] [OPTIONS]

Arguments

  • PIPELINE_NAME (optional): The name of the pipeline to run. Defaults to open_neuro if not specified. Note: this is the only pipeline that exits at this time.
    • Available pipelines: open_neuro

Options

  • --output-dir (optional): Directory where pipeline output will be stored. Defaults to output/.

    • Example: --output-dir /path/to/output
  • --clean (optional): Run the pipeline without using any precomputed or previous output. When this flag is set, the pipeline will perform a fresh run from scratch.

    • Example: --clean

Steps

The pipeline consists of 4 steps:

  1. ETL: This step will call the OpenNeuro API and extract the dataset meta, transform it into DATS and save the contents as jsonl.
  2. Publication Extraction: Uses the dataset identifers (title and dataset id) to search PMC eSearch for publications that include these terms. Then downloads their full text to the output/full_text directory. Also stores the list of mentions to dataset_publication_mentions.csv
  3. agentic classifer: For each dataset in dataset_publication_mentions.csv, extracts the sections mentioning a dataset from the full text and passes it to the agent to classify the usage type. Outputs results to dataset_usage_classification.csv.
  4. S-index calculation: Uses the dataset metadata and agent classifications to compute each authors S-index. See Final S-index output for output.

Examples

Run a specific pipeline:

python main.py open_neuro

Specify a custom output directory:

python main.py --output-dir results

Run with clean mode (no precomputed data):

python main.py  --clean

Limit the number of full text publications to fetch per dataset:

python main.py --full-text-limit=10

Combine options:

python main.py open_neuro --output-dir custom_output --clean

Running with precomputed outputs

We understand steps of this pipeline will require a great deal of time to run. To circumvent this, the script will prompt you if you want to reuse our precomputed outputs found in the /data dir.

These precomputed outputs will work if you use the precomputed outputs for steps 1-3. However, if you want to run the classifer with precomputed ouputs for step 2 you will require the full PMC archive of papers which totals around 650 GB. This is not recommended but offered for compeleteness. Please refer to dataset_publication_mentions.csv for info on where to extract the xml documents.

In summary, to retrieve the same S-index calculations uploaded to our application, use the precomputed outputs for steps 1-3.

Getting Help

To view available commands and options:

python main.py --help

Output

Pipeline results are written to the specified output directory (default: output/). The structure will be as follows:

output
├── open_neuro
│   ├── etl
│   │   ├── raw
│   │   ├── raw
│   │   │   └── 0.raw.jsonl
│   │   └── transformed
│   │   │   └── 0.transformed.jsonl
│   ├── full_text
│   │   ├── PMC_1.xml
│   │   ├── PMC_2.xml
│   │   ├── PMC_3.xml
│   │   ├── ...
│   │   └── PMC_N.xml
│   ├── dataset_publication_mentions.csv
│   ├── dataset_usage_classification.csv
└── └── s_index_final_output.jsonl

The output directory contains the following components:

  • open_neuro/: Pipeline-specific output directory for the OpenNeuro pipeline
    • etl/: ETL (Extract, Transform, Load) processing directory
      • raw/: Contains raw extracted data in JSONL format (0.raw.jsonl), representing the initial unprocessed data extracted from the source.
      • transformed/: Contains transformed/processed data (0.transformed.jsonl), representing cleaned and normalized data after transformation steps in DATS format.
    • full_text/: Directory containing full-text XML files from PubMed Central (PMC). Each file represents a complete publication in XML format, identified by its PMC ID.
    • dataset_publication_mentions.csv: A CSV file that maps datasets to their associated publication mentions, linking datasets with the publications that reference them.
    • dataset_usage_classification.csv: Contains the dataset id, the usage classifcation and the PMC id of the publication mentioning it.
    • s_index_final_output.jsonl: The final processed output file in JSONL (JSON Lines) format, containing the complete S-Index data ready for use.

Final S-index output

The final output s_index_final_output.jsonl contains the S-index for each author around with the number of publication mentions usages, dataset IDS and PMC_IDs:

{
  "name": "Tom Schonberg",
  "s_index": 2,
  "datasets": [
    {
      "dataset_id": "ds003782",
      "used": 0,
      "mentioned": 0
    },
    {
      "dataset_id": "ds000001",
      "used": 3,
      "mentioned": 1,
      "pmc_ids": {
        "used": [
          "PMC6618324",
          "PMC7927288",
          "PMC8764489"
        ],
        "mentioned": [
          "PMC3366349"
        ]
      }
    },
    {
      "dataset_id": "ds001417",
      "used": 0,
      "mentioned": 0,
      "pmc_ids": {
        "used": [],
        "mentioned": []
      }
    },
    {
      "dataset_id": "ds001734",
      "used": 5,
      "mentioned": 1,
      "pmc_ids": {
        "used": [
          "PMC12319750",
          "PMC12230699",
          "PMC12247560",
          "PMC11627503",
          "PMC7578785"
        ],
        "mentioned": [
          "PMC7771346"
        ]
      }
    }
  ]
}

Evaluation and Review

To review our pipeline run the script and accept precomputed outputs for steps 1-3. This will create the S-index calculations for all of OpenNeuro.

For a rigious test, run the pipeline without precomputed outputs:

python main.py --clean

Note: this will take hours to complete. On average the ETL step takes 30m, publication extraction ~1-2days (with no limit set) and classifcation ~2hrs. To save on time, run with the --full-text-limit tag to reduce the number of publications fetched per dataset.

Common issues

During the ETL step, a few server side issues with the OpenNeuro API may present themselves. These are outside of our control, but we decided to document them here in case you face similar issues.

Rarely, the script may fail with a cryptic gql error. If encountered please cancel and re-run the script. If this happens repeatedly, try running at a differnt time.

More commonly, you may see a error message pop up saying Failed on cursor eyJvZ.... along with a large json output. This can be ignored long as you see the message Found 1100 datasets proceeding it.

Additionally, you may see output saying Found 100 datasets were identifed. This happens when the cursor to the next page request failed and only the first 100 results were fetched. As before, reruning the script should resolve the issue and show the correct number of datasets (which is around 1100).

About

Our s-index pipeline for the NIH Data Sharing Index (S-index) Challenge

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages