You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
本地化tokenize数据集,当数据集较大时,产生如下报错:
`Resolving data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 18/18 [00:00<00:00, 183246.29it/s]
Generating train split: 1024239 examples [01:56, 8772.38 examples/s]
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1989, in _prepare_split_single
writer.write_table(table)
File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_writer.py", line 583, in write_table
pa_table = pa_table.combine_chunks()
File "pyarrow/table.pxi", line 4289, in pyarrow.lib.Table.combine_chunks
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/mnt/disk3/gyj/xtuner/xtuner/tools/process_untokenized_datasets.py", line 55, in
processed_dataset = process_untokenized_dataset(cfg)
File "/mnt/disk3/gyj/xtuner/xtuner/tools/process_untokenized_datasets.py", line 46, in process_untokenized_dataset
dataset = BUILDER.build(config.train_dataloader.dataset)
File "/usr/local/lib/python3.10/dist-packages/mmengine/registry/registry.py", line 570, in build
return self.build_func(cfg, *args, **kwargs, registry=self)
File "/usr/local/lib/python3.10/dist-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
File "/usr/local/lib/python3.10/dist-packages/xtuner/dataset/huggingface.py", line 298, in process_hf_dataset
return process(**kwargs)
File "/usr/local/lib/python3.10/dist-packages/xtuner/dataset/huggingface.py", line 167, in process
dataset = build_origin_dataset(dataset, split)
File "/usr/local/lib/python3.10/dist-packages/xtuner/dataset/huggingface.py", line 30, in build_origin_dataset
dataset = BUILDER.build(dataset)
File "/usr/local/lib/python3.10/dist-packages/mmengine/registry/registry.py", line 570, in build
return self.build_func(cfg, *args, **kwargs, registry=self)
File "/usr/local/lib/python3.10/dist-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
File "/usr/local/lib/python3.10/dist-packages/datasets/load.py", line 2582, in load_dataset
builder_instance.download_and_prepare(
File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1005, in download_and_prepare
self._download_and_prepare(
File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1100, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1860, in _prepare_split
for job_id, done, content in self._prepare_split_single(
File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 2016, in _prepare_split_single
raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset`
The text was updated successfully, but these errors were encountered:
本地化tokenize数据集,当数据集较大时,产生如下报错:
`Resolving data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 18/18 [00:00<00:00, 183246.29it/s]
Generating train split: 1024239 examples [01:56, 8772.38 examples/s]
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1989, in _prepare_split_single
writer.write_table(table)
File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_writer.py", line 583, in write_table
pa_table = pa_table.combine_chunks()
File "pyarrow/table.pxi", line 4289, in pyarrow.lib.Table.combine_chunks
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/mnt/disk3/gyj/xtuner/xtuner/tools/process_untokenized_datasets.py", line 55, in
processed_dataset = process_untokenized_dataset(cfg)
File "/mnt/disk3/gyj/xtuner/xtuner/tools/process_untokenized_datasets.py", line 46, in process_untokenized_dataset
dataset = BUILDER.build(config.train_dataloader.dataset)
File "/usr/local/lib/python3.10/dist-packages/mmengine/registry/registry.py", line 570, in build
return self.build_func(cfg, *args, **kwargs, registry=self)
File "/usr/local/lib/python3.10/dist-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
File "/usr/local/lib/python3.10/dist-packages/xtuner/dataset/huggingface.py", line 298, in process_hf_dataset
return process(**kwargs)
File "/usr/local/lib/python3.10/dist-packages/xtuner/dataset/huggingface.py", line 167, in process
dataset = build_origin_dataset(dataset, split)
File "/usr/local/lib/python3.10/dist-packages/xtuner/dataset/huggingface.py", line 30, in build_origin_dataset
dataset = BUILDER.build(dataset)
File "/usr/local/lib/python3.10/dist-packages/mmengine/registry/registry.py", line 570, in build
return self.build_func(cfg, *args, **kwargs, registry=self)
File "/usr/local/lib/python3.10/dist-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
File "/usr/local/lib/python3.10/dist-packages/datasets/load.py", line 2582, in load_dataset
builder_instance.download_and_prepare(
File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1005, in download_and_prepare
self._download_and_prepare(
File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1100, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1860, in _prepare_split
for job_id, done, content in self._prepare_split_single(
File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 2016, in _prepare_split_single
raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset`
The text was updated successfully, but these errors were encountered: