Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Metadata Improvement]: Address duplication between DDE and other sources #162

Open
2 of 17 tasks
gtsueng opened this issue Aug 21, 2024 · 4 comments
Open
2 of 17 tasks
Assignees
Labels
enhancement New feature or request

Comments

@gtsueng
Copy link
Contributor

gtsueng commented Aug 21, 2024

Issue Name

Address duplication between DDE and other sources

Issue Description

Metadata submitted via the DDE are submitted via Program/project-specific portals. For many of the submissions, the metadata is meant to supplement metadata for a record that has already been submitted elsewhere. As it would be very difficult to merge a DDE record with a metadata record held elsewhere an interim solution may be to link the records using the sameAs field. This field allows for an array of URLs, so it could potentially be used to link between two same metadata records (that we cannot readily merge) within the NDE Discovery Portal

Issue Discussion

The issue of duplicate records between the DDE and other resources was discussed multiple times during discussions of de-duplication between OMICS-DI and SRA, OMICS-DI and GEO, and within Zenodo.

Please select the type of metadata improvement

  • Standardization (normalizing free text to an ontology)
  • Augmentation (adding values for metadata fields missing values)
  • Clean up (addressing redundancy or messy metadata)
  • Structure (changing the structuring of the metadata to support front end UI features)

Meta URL

No response

Related WBS task

https://github.com/NIAID-Data-Ecosystem/nde-roadmap/issues/14

For internal use only. Assignee, please select the status of this issue

  • Not yet started
  • In progress
  • Blocked
  • Will not address

Status Description

No response

Request status check list

  • This metadata improvement has yet to be discussed between NIAID, Scripps, Leidos
  • This metadata improvement does not need to be discussed between NIAID, Scripps, Leidos
  • This metadata improvement has been discussed/reported between NIAID, Scripps, Leidos
  • This metadata improvement has been implemented locally to generate data for review
  • This metadata improvement has been implemented on Dev
  • This metadata improvement has been implemented on Dev and the results have been reviewed and approved for staging
  • This metadata improvement has been implemented on Staging
  • This page/documentation/change has been approved for Production
  • This page/documentation/change has been implemented on Production
@gtsueng gtsueng added the enhancement New feature or request label Aug 21, 2024
@gtsueng
Copy link
Contributor Author

gtsueng commented Sep 30, 2024

Related to this issue (and received on 2024.09.30) from CREID

During the process of creating the bulk upload excel, I was able to identify a set of 58 dataset records in the NDE that need to be flagged as CREID assets. The limitation with the current flagging system is that it does not account for those assets that come in from other repositories. There needs to be a way to indicate through the use of grant numbers that a dataset should be flagged as part of CRIED, otherwise we limit ourselves to only assets registered on the DDE.

To get the list of 58, I searched PubMed for all of the CREID grant numbers. There are almost 700 PubMed ID’s that list a CRIED grant number currently. I ran a query in the NDE and found 58 datasets associated with these PMID’s. Is it possible for your group to manually add the CREID tag to these? The excel I am attaching has the NIAID links for these dataset records. It also shows the PMID’s that we can keep querying against the NDE that DON’T show a dataset yet but DO list a CREID grant number on PubMed.

@gtsueng
Copy link
Contributor Author

gtsueng commented Feb 5, 2025

@hartwickma, @rshabman, @lisa-mml, @sudvenk

This issue was discussed at the bi-weekly meeting dated 2025.02.04 and is paused awaiting internal discussion by NIAID/ODSET. I am marking it as Review Requested-refinement to make it easier for you to find/respond to this issue.

For context, the interim solution we proposed to implement was inspired by how PubMed handles duplicate entries coming from a preprint server and official publisher:
Image

During the discussion, a potential solution for handling duplicates between DRYAD and Zenodo would be to delete the Zenodo record; however, this solution would not apply to DDE duplicates as the corresponding duplicate record may have less metadata (especially critical funding data)

@rshabman
Copy link

rshabman commented Feb 5, 2025

@gtsueng here is our recommendation for next steps based on our internal conversation:

  • for duplicate records that can be collapsed automatically, please proceed.
  • for duplicates that can not be collapsed automatically and require linking via DDE - please estimate the # of records and the time required to link in the DDE.
  • Also, can the team brainstorm any other approaches that would provide a single solution for removing duplicate records. Is there a solution that can be applied in a uniform way? At this stage, an approach that is generalizable will benefit the portal as additional duplicates appear.

@gtsueng
Copy link
Contributor Author

gtsueng commented Feb 5, 2025

@rshabman, @sudvenk,

  • Regarding duplicate records that can be collapsed automatically. These are already automatically merged by default

  • Regarding DDE duplicate records:

    • The Initial list of duplicate mappings is already done. An update of the build will be required to update the metadata for each record and is estimated to take a day to update the post-build process and run the rebuild when triggered, BUT there are currently other builds in progress (so we can't trigger this immediately). Note that this list probably already out-of-date since CREID appears to be actively adding records into the DDE
    • The code for implementation of a simple link between records on the UI side has already been written, only needs to be reviewed and pushed so it would probably take 1-2 days
  • Regarding a singular solution for all duplicates: We'd love that too, but from what we've observed, the issue is too complex to address with a single solution. So far, we have different types of duplicates, and the ability to address them is dependent on the metadata availability and consistency:

    • Uses same identifier at the repository level: Can be automatically merged
    • Uses different identifiers at the repository level, but share a doi AND one repository consistently has better metadata than the other: Can be automatically linked
      • This means we can potentially copy the data access information and delete the record from the repository with poorer metadata (so it would look like an automatically merged record)
    • Uses different identifiers at the repository level, but share some other identifier AND it's unclear or inconsistent which repository has better metadata: Can potentially be linked pending semi-automated or manual review (this is the case of the DDE) -- it's better to just keep both and link the records in this case
    • Uses different identifiers at the repository level, shares no other identifiers, BUT has the same title and authors: These can't be automatically identified precisely (since short species-based titles are frequently used for different datasets, and author formatting can make the same person appear different).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants