Skip to content

Conversation

MFreidank
Copy link
Contributor

@MFreidank MFreidank commented Apr 25, 2022

This PR implements a dataloader for BioASQ Task A (text task) in an attempt to close issue #261 . I've attempted to stay close to examples/bioasq_task_b.py whereever possible. Please let me know if any changes are required.

Tagging @jason-fries as we discussed previously on the issue thread and I noticed he made some recent changes to bioasq_task_b that I also tried to match with my PR.

I have been able to confirm that my data loader works across years and also got unit test runs for individual years to pass (I tried 2022 and 2013). However, it's hard for me to do a single clean unittest run across all configurations as dataset sizes are very large (>>10 GB for some of the files) and individual tests take a very long time to run on the machine I have access to.
Could someone help with testing?

If the following information is NOT present in the issue, please populate:

Checkbox

  • Confirm that this PR is linked to the dataset issue.
  • Create the dataloader script biodatasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
  • Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _BIGBIO_VERSION variables.
  • Implement _info(), _split_generators() and _generate_examples() in dataloader script.
  • Make sure that the BUILDER_CONFIGS class attribute is a list with at least one BigBioConfig for the source schema and one for a bigbio schema.
  • Confirm dataloader script works with datasets.load_dataset function.
  • Confirm that your dataloader script passes the test suite run with python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py.
  • If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

@hakunanatasha
Copy link
Collaborator

@MFreidank happy to help you run it; do you know if the full dataset is an aggregate of the individual years? If the individual years pass via --subset_id in the unit tests, then this is fine.

I noticed this also requires a package that is not default ijson; is this a hard requirement? If so, i'll need to update the requirements file too

@MFreidank
Copy link
Contributor Author

Hi @hakunanatasha

Thank you for offering your help.

Yes, essentially the full dataset would be the aggregate over all individual years.
My challenge is that due to dataset size running tests for any single --subset_id takes time on my machine (>>8h for the subsets I tried). This makes a "full" iteration over all implemented subsets difficult/slow and I was only able to unittest individual subsets (for years 2013 and 2022) and verify that dataset loading works across all datasets.

Regarding ijson: I believe it would at least make our code a lot simpler.
My reason for using it was that loading data via json.load unfortunately attempts to load 20+ GB data into memory in one shot for most of the subsets. This is unlikely to work well on most end user machines (my own included).
ijson parses the JSON files one object at a time and therefore never reads the whole file into memory at any given point.
This allows users on (nearly) any machine to load this data.
The alternative would be to essentially implement an on-the-fly object by object JSON parser inline (more or less replicating ijson functionality within our repository) as unfortunately the python standard library does not currently have direct means to do this.
I felt that would add enough complexity that having this dependency is warranted.

Please let me know if any of the above is unclear.

@hakunanatasha
Copy link
Collaborator

@MFreidank good answers - 8h is a bit tough. Let me see if my machine can handle it.

an on-the-fly json parser is overkill; your rationale is plenty enough to warrant a new package in the requirements!

@MFreidank
Copy link
Contributor Author

Hi @hakunanatasha - any updates?
I read some things on discord around major changes being under way.
Please let me know when I should update this PR to ensure it stays up to date.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants