[Metadata Improvement]: Address duplication between DDE and other sources #162

gtsueng · 2024-08-21T20:24:38Z

Issue Name

Address duplication between DDE and other sources

Issue Description

Metadata submitted via the DDE are submitted via Program/project-specific portals. For many of the submissions, the metadata is meant to supplement metadata for a record that has already been submitted elsewhere. As it would be very difficult to merge a DDE record with a metadata record held elsewhere an interim solution may be to link the records using the sameAs field. This field allows for an array of URLs, so it could potentially be used to link between two same metadata records (that we cannot readily merge) within the NDE Discovery Portal

Issue Discussion

The issue of duplicate records between the DDE and other resources was discussed multiple times during discussions of de-duplication between OMICS-DI and SRA, OMICS-DI and GEO, and within Zenodo.

Please select the type of metadata improvement

Standardization (normalizing free text to an ontology)
Augmentation (adding values for metadata fields missing values)
Clean up (addressing redundancy or messy metadata)
Structure (changing the structuring of the metadata to support front end UI features)

Meta URL

No response

Related WBS task

https://github.com/NIAID-Data-Ecosystem/nde-roadmap/issues/14

For internal use only. Assignee, please select the status of this issue

Not yet started
In progress
Blocked
Will not address

Status Description

No response

Request status check list

This metadata improvement has yet to be discussed between NIAID, Scripps, Leidos
This metadata improvement does not need to be discussed between NIAID, Scripps, Leidos
This metadata improvement has been discussed/reported between NIAID, Scripps, Leidos
This metadata improvement has been implemented locally to generate data for review
This metadata improvement has been implemented on Dev
This metadata improvement has been implemented on Dev and the results have been reviewed and approved for staging
This metadata improvement has been implemented on Staging
This page/documentation/change has been approved for Production
This page/documentation/change has been implemented on Production

The text was updated successfully, but these errors were encountered:

gtsueng · 2024-09-30T18:52:06Z

Related to this issue (and received on 2024.09.30) from CREID

During the process of creating the bulk upload excel, I was able to identify a set of 58 dataset records in the NDE that need to be flagged as CREID assets. The limitation with the current flagging system is that it does not account for those assets that come in from other repositories. There needs to be a way to indicate through the use of grant numbers that a dataset should be flagged as part of CRIED, otherwise we limit ourselves to only assets registered on the DDE.

To get the list of 58, I searched PubMed for all of the CREID grant numbers. There are almost 700 PubMed ID’s that list a CRIED grant number currently. I ran a query in the NDE and found 58 datasets associated with these PMID’s. Is it possible for your group to manually add the CREID tag to these? The excel I am attaching has the NIAID links for these dataset records. It also shows the PMID’s that we can keep querying against the NDE that DON’T show a dataset yet but DO list a CREID grant number on PubMed.

gtsueng · 2025-02-05T19:18:10Z

@hartwickma, @rshabman, @lisa-mml, @sudvenk

This issue was discussed at the bi-weekly meeting dated 2025.02.04 and is paused awaiting internal discussion by NIAID/ODSET. I am marking it as Review Requested-refinement to make it easier for you to find/respond to this issue.

For context, the interim solution we proposed to implement was inspired by how PubMed handles duplicate entries coming from a preprint server and official publisher:

During the discussion, a potential solution for handling duplicates between DRYAD and Zenodo would be to delete the Zenodo record; however, this solution would not apply to DDE duplicates as the corresponding duplicate record may have less metadata (especially critical funding data)

rshabman · 2025-02-05T20:08:10Z

@gtsueng here is our recommendation for next steps based on our internal conversation:

for duplicate records that can be collapsed automatically, please proceed.
for duplicates that can not be collapsed automatically and require linking via DDE - please estimate the # of records and the time required to link in the DDE.
Also, can the team brainstorm any other approaches that would provide a single solution for removing duplicate records. Is there a solution that can be applied in a uniform way? At this stage, an approach that is generalizable will benefit the portal as additional duplicates appear.

gtsueng · 2025-02-05T21:35:46Z

@rshabman, @sudvenk,

Regarding duplicate records that can be collapsed automatically. These are already automatically merged by default
Regarding DDE duplicate records:
- The Initial list of duplicate mappings is already done. An update of the build will be required to update the metadata for each record and is estimated to take a day to update the post-build process and run the rebuild when triggered, BUT there are currently other builds in progress (so we can't trigger this immediately). Note that this list probably already out-of-date since CREID appears to be actively adding records into the DDE
- The code for implementation of a simple link between records on the UI side has already been written, only needs to be reviewed and pushed so it would probably take 1-2 days
Regarding a singular solution for all duplicates: We'd love that too, but from what we've observed, the issue is too complex to address with a single solution. So far, we have different types of duplicates, and the ability to address them is dependent on the metadata availability and consistency:
- Uses same identifier at the repository level: Can be automatically merged
- Uses different identifiers at the repository level, but share a doi AND one repository consistently has better metadata than the other: Can be automatically linked
  - This means we can potentially copy the data access information and delete the record from the repository with poorer metadata (so it would look like an automatically merged record)
- Uses different identifiers at the repository level, but share some other identifier AND it's unclear or inconsistent which repository has better metadata: Can potentially be linked pending semi-automated or manual review (this is the case of the DDE) -- it's better to just keep both and link the records in this case
- Uses different identifiers at the repository level, shares no other identifiers, BUT has the same title and authors: These can't be automatically identified precisely (since short species-based titles are frequently used for different datasets, and author formatting can make the same person appear different).
  - example 1: duplicate between DRYAD and Zenodo can be identified automatically, but not the duplicate with Figshare
  - example 2: duplicates with different identifiers and missing name in one case

gtsueng added the enhancement New feature or request label Aug 21, 2024

gtsueng assigned gtsueng and DylanWelzel Jan 24, 2025

gtsueng mentioned this issue Feb 5, 2025

[User Interface: Redesign]: Provide a prominent display for records that are the same as other records within the portal NIAID-Data-Ecosystem/nde-portal#278

Open

19 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Metadata Improvement]: Address duplication between DDE and other sources #162

[Metadata Improvement]: Address duplication between DDE and other sources #162

gtsueng commented Aug 21, 2024 •

edited

Loading

gtsueng commented Sep 30, 2024

gtsueng commented Feb 5, 2025

rshabman commented Feb 5, 2025

gtsueng commented Feb 5, 2025

[Metadata Improvement]: Address duplication between DDE and other sources #162

[Metadata Improvement]: Address duplication between DDE and other sources #162

Comments

gtsueng commented Aug 21, 2024 • edited Loading

Issue Name

Issue Description

Issue Discussion

Please select the type of metadata improvement

Meta URL

Related WBS task

For internal use only. Assignee, please select the status of this issue

Status Description

Request status check list

gtsueng commented Sep 30, 2024

gtsueng commented Feb 5, 2025

rshabman commented Feb 5, 2025

gtsueng commented Feb 5, 2025

gtsueng commented Aug 21, 2024 •

edited

Loading