Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays #972

Open
jiejie1993 opened this issue Dec 11, 2024 · 1 comment
Open

Comments

@jiejie1993
Copy link

本地化tokenize数据集,当数据集较大时,产生如下报错:
`Resolving data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 18/18 [00:00<00:00, 183246.29it/s]
Generating train split: 1024239 examples [01:56, 8772.38 examples/s]
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1989, in _prepare_split_single
writer.write_table(table)
File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_writer.py", line 583, in write_table
pa_table = pa_table.combine_chunks()
File "pyarrow/table.pxi", line 4289, in pyarrow.lib.Table.combine_chunks
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/mnt/disk3/gyj/xtuner/xtuner/tools/process_untokenized_datasets.py", line 55, in
processed_dataset = process_untokenized_dataset(cfg)
File "/mnt/disk3/gyj/xtuner/xtuner/tools/process_untokenized_datasets.py", line 46, in process_untokenized_dataset
dataset = BUILDER.build(config.train_dataloader.dataset)
File "/usr/local/lib/python3.10/dist-packages/mmengine/registry/registry.py", line 570, in build
return self.build_func(cfg, *args, **kwargs, registry=self)
File "/usr/local/lib/python3.10/dist-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
File "/usr/local/lib/python3.10/dist-packages/xtuner/dataset/huggingface.py", line 298, in process_hf_dataset
return process(**kwargs)
File "/usr/local/lib/python3.10/dist-packages/xtuner/dataset/huggingface.py", line 167, in process
dataset = build_origin_dataset(dataset, split)
File "/usr/local/lib/python3.10/dist-packages/xtuner/dataset/huggingface.py", line 30, in build_origin_dataset
dataset = BUILDER.build(dataset)
File "/usr/local/lib/python3.10/dist-packages/mmengine/registry/registry.py", line 570, in build
return self.build_func(cfg, *args, **kwargs, registry=self)
File "/usr/local/lib/python3.10/dist-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
File "/usr/local/lib/python3.10/dist-packages/datasets/load.py", line 2582, in load_dataset
builder_instance.download_and_prepare(
File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1005, in download_and_prepare
self._download_and_prepare(
File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1100, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1860, in _prepare_split
for job_id, done, content in self._prepare_split_single(
File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 2016, in _prepare_split_single
raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset`

@jiejie1993
Copy link
Author

如果是jsonl格式的话,使用load_dataset可以正常读取。如果想使用jsonl格式的数据,需要如何修改代码?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant