Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Metadata Improvement]: GPT topicCategory hallucinations #154

Open
3 of 19 tasks
gtsueng opened this issue Jul 12, 2024 · 2 comments
Open
3 of 19 tasks

[Metadata Improvement]: GPT topicCategory hallucinations #154

gtsueng opened this issue Jul 12, 2024 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@gtsueng
Copy link
Contributor

gtsueng commented Jul 12, 2024

Issue Name

GPT topicCategory hallucinations

Issue Description

Records with topicCategory hallucinations | hallucinated topicCategory
https://data-staging.niaid.nih.gov/resources?id=zenodo_8112069 | Human biology
https://data-staging.niaid.nih.gov/resources?id=zenodo_7574524 | Human biology
https://data-staging.niaid.nih.gov/resources?id=zenodo_5001210 | Human biology
https://data-staging.niaid.nih.gov/resources?id=zenodo_7104970 | Human biology
https://data-staging.niaid.nih.gov/resources?id=zenodo_6139140 | Human biology
https://data-staging.niaid.nih.gov/resources?id=zenodo_6787653 | Human biology
https://data-staging.niaid.nih.gov/resources?id=zenodo_8009444 | Human biology
https://data-staging.niaid.nih.gov/resources?id=zenodo_2629148 | Human biology
https://data-staging.niaid.nih.gov/resources?id=zenodo_7226308 | Human biology
https://data-staging.niaid.nih.gov/resources?id=mendeley_89d2d85g9s | Human biology
https://data-staging.niaid.nih.gov/resources?id=dataverse_10.7910_dvn_meudnp | Human biology
https://data-staging.niaid.nih.gov/resources?id=dataverse_10.7910_dvn_mh3osr | Human biology
https://data-staging.niaid.nih.gov/resources?id=dataverse_10.7910_dvn_qupdyi | Human biology
https://data-staging.niaid.nih.gov/resources?id=mendeley_gbypg9zky3 | Human biology

It appears that GPT is overly prescribing the Human biology topicCategory to any documents mentioning humans, including human impact on the environment.

To do:

  • Identify other frequently hallucinated terms
  • Develop heuristics for filtering out excess hallucinated terms
    • The heuristic maybe length-based (name and description provided to GPT)
    • Can be based on the other topicCategories GPT found (e.g. - remove Human biology, 'Ecology' and 'Biodiversity' also topicCategories)
    • Can be restricted to repositories (only apply the heuristics to Mendeley and Zenodo)

Related issues

NIAID-Data-Ecosystem/nde-metadata-corrections#20

Issue Discussion

No response

Please select the type of metadata improvement

  • Standardization (normalizing free text to an ontology)
  • Augmentation (adding values for metadata fields missing values)
  • Clean up (addressing redundancy or messy metadata)
  • Structure (changing the structuring of the metadata to support front end UI features)

Meta URL

No response

Related WBS task

https://github.com/NIAID-Data-Ecosystem/nde-roadmap/issues/4

For internal use only. Assignee, please select the status of this issue

  • Not yet started
  • In progress
  • Blocked
  • Will not address

Status Description

No response

Request status check list

  • This metadata improvement has yet to be discussed between NIAID, Scripps, Leidos
  • This metadata improvement does not need to be discussed between NIAID, Scripps, Leidos
  • This metadata improvement has been discussed/reported between NIAID, Scripps, Leidos
  • This metadata improvement has been implemented locally to generate data for review
  • This metadata improvement has been implemented on Dev
  • This metadata improvement has been implemented on Dev and the results have been reviewed and approved for staging
  • This metadata improvement has been implemented on Staging
  • This page/documentation/change has been approved for Production
  • This page/documentation/change has been implemented on Production
@gtsueng
Copy link
Contributor Author

gtsueng commented Aug 6, 2024

@gtsueng
Copy link
Contributor Author

gtsueng commented Aug 6, 2024

At least for Human biology, the we can use the following inclusion criteria to reduce the bias:

Include 'Human biology' only if one of these other topicCategory values is included:

  • Anatomy
  • Transcriptomics
  • Developmental biology
  • Oncology
  • Physiology

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants