Skip to content

Bulkrax imports

Dan Kerchner edited this page Dec 15, 2024 · 7 revisions

Importing ETDs from ProQuest

Prerequisites

.env values are populated for:

  • AWS_ACCESS_KEY_ID
  • AWS_SECRET_ACCESS_KEY
  • AWS_REGION
  • AWS_PROQUEST_ETD_BUCKET_NAME

Download new (and only new) ETDs from S3

  1. Make sure that /opt/scholarspace/scholarspace-ingest/etd_zips is empty. If this directory is not empty, remove all files from it

  2. Run the gwss:download_new_pq_zips task, which requires a destination path argument. For example: bundle exec rails gwss:download_new_pq_zips['/opt/scholarspace/scholarspace-ingest/etd_zips']

Create Bulkrax manifest

Run the ingest_pq_etds rake task either inside the container or from the outside using docker exec. The task requires an argument, which is the path to the directory containing the ProQuest zips you wish to include in the ingest. For example, bundle exec rails gwss:ingest_pq_etds['/opt/scholarspace/scholarspace-ingest/etd-zips'] if etds are in /opt/scholarspace/scholarspace-ingest/etd-zips.

The Bulkrax manifest will be written in a bulkrax_zips directory, inside the directory corresponding to the value of the TEMP_FILE_BASE environment variable (typically set in .env). The manifest contains:

  • a metadata.csv Bulkrax-compliant manifest file
  • a files directory, containing a directory for each ETD zip, which itself contains:
    • the ProQuest XML file
    • the main ETD PDF
    • optionally, a folder containing additional attachments for the ETD

Import the Bulkrax manifest

Within the GW ScholarSpace web application, log in as an administrative user. On the Dashboard, click on Importers. Create a New importer with the following values:

  • Name = any name
  • Administrative Set = ETDs
  • Frequency = Once (on save)
  • Limit = leave blank
  • Parser = CSV - Comma Separated Values
  • Visibility = Public
  • Rights Statement = leave blank
  • Add CSV File to Import: Specify a Path on the Server. Import file path = {TEMP_FILE_BASE}/bulkrax_zip/metadata.csv
  • Before starting the import, open a tab to the Sidekiq administrator (at /sidekiq) so that you can watch progress of the queues and monitor for any problems.

Then proceed and click Create and Import.

*If you wish to re-run the task to generate the bulkrax-ready metadata and files, then you'll need to first clear out the results of the previous run: rm -r {TEMP_FILE_BASE}/bulkrax_zip

Clean up

  1. Remove all files downloaded to /opt/scholarspace/scholarspace-ingest/etd_zips (so that they won't be re-loaded next time).

  2. Remove the Bulkrax metadata.csv and files directory: rm -r {TEMP_FILE_BASE}/bulkrax_zip

Importing other works using Bulkrax

You will need to create a zip file containing:

  • metadata.csv (TODO: provide example metadata.csv). Column names should be: "model", "title", "creator", "contributor", "language", "description", "keyword", "degree", "resource_type", "advisor", "gw_affiliation", "date_created", "committee_member", "rights_statement", "license", "proquest_zipfile", "bulkrax_identifier", "file", "parents", "visibility", "visibility_during_embargo", "visibility_after_embargo", "embargo_release_date"
  • a files directory containing attachments referenced in metadata.csv

Troubleshooting FAQ

Q: When I create an importer, the administrative set that I wish to import to isn't showing up in the dropdown list.

A: This can occur when your user has the admin role and can therefore access /importers but does not have the contentadmin role; contentadmins can import to any admin set. Try adding the contentadmin role to your administrative user.