Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incomplete text (abstract or title) #2

Closed
iacopy opened this issue Feb 25, 2022 · 1 comment
Closed

Incomplete text (abstract or title) #2

iacopy opened this issue Feb 25, 2022 · 1 comment
Labels
bug Something isn't working

Comments

@iacopy
Copy link
Owner

iacopy commented Feb 25, 2022

Describe the bug

See gijswobben#23

While iterating on articles resulting from a PubMed query, I also noticed that the abstract is sometimes incomplete :

For instance :
Query : ((Haliaeetus leucocephalus[Title/Abstract])) AND ((prey[Title/Abstract]) OR (diet[Title/Abstract]))

Returns (when printing first 10 results) :
pubmed_id = '31015971'
abstract = 'Bald eagle ('

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

  • OS: [e.g. Windows / Linux]
  • Version [e.g. Python 3.7.0]

Additional context
Add any other context about the problem here.

@iacopy iacopy added the bug Something isn't working label Feb 25, 2022
iacopy added a commit that referenced this issue Feb 25, 2022
In some cases the title and/or abstract obtained was incomplete
(issue #2) -- 23 in the original pymed repo.

This happens when the text contains html markup tags
(<b>, <i>, <sub>, <sup>, ...).

Example: PMID 31689885
<ArticleTitle>Gamma Irradiated <i>Rhodiola sachalinensis</i> Extract Ameliorates [...]</ArticleTitle>
Before the fix the returned title was just: 'Gamma Irradiated '
<AbstractText>The effect of <i>Rhodiola sachalinensis</i> Boriss extract irradiated [...]</ArticleTitle>
Before the fix the returned abstract was just: 'The effect of '

Fastest solution found: cleanup of frequently used html markup tags like <b>, <i>, <sub>, <sup>.
It seems to fix most of papers correctly, at least for the above mentioned tags.
@iacopy
Copy link
Owner Author

iacopy commented Feb 25, 2022

Fixed in PR #5

@iacopy iacopy closed this as completed Feb 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant