Skip to content
This repository has been archived by the owner on Feb 24, 2024. It is now read-only.

sample data for local development testing #2

Open
htthYjh opened this issue Jul 6, 2022 · 4 comments
Open

sample data for local development testing #2

htthYjh opened this issue Jul 6, 2022 · 4 comments

Comments

@htthYjh
Copy link

htthYjh commented Jul 6, 2022

Hi, this is a great project. Can you provide some sample data for local development testing? I want to test it out. thank you very much!

@conceptofmind
Copy link
Owner

conceptofmind commented Jul 6, 2022

Hi @htthYjh ,

The repository initially consisted of just the pre-training architecture but I am actively updating it on a daily basis. The full repository when completed will allow for scalable and distributed pre-training.

I am working on reproducing the OPT and GODEL pre-training corpora. I will be uploading both of them to Huggingface datasets. Currently, there is a Huggingface streaming data loader implemented which allows you to use the Pile dataset by EleutherAI for pre-training. I will be updating the repository to include a local environment data loader to go along with the streaming one.

Best,

Enrico

@htthYjh
Copy link
Author

htthYjh commented Jul 8, 2022

Thank you so much, great work, looking forward to your progress.

@conceptofmind
Copy link
Owner

Hi @htthYjh ,

I rebuilt the data loader to work locally: https://github.com/conceptofmind/LaMDA-pytorch/blob/main/lamda_pytorch/build_dataloader.py

A few things you are going to have to take into consideration if you are going to use the provided Pile dataset:

  • The Pile dataset is over 1TB of data. You will likely need a storage device with up to 2TB of space for everything including the tokenizer and saved models.
  • If you want to use a different dataset, I provided a configuration file with different fields that can be adjusted. You can find a bunch of different text generation datasets on the Huggingface Datasets hub. I will still be uploading the GODEL conversational dataset as well.

The configuration for the data loader looks like this:

"""
Configuration for data loader.
"""

use_huggingface: bool = field(
    default = True,
    metadata = {'help': 'Whether to use huggingface datasets'}
)

train_dataset_name: Optional[str] = field(
    default="the_pile", 
    metadata={"help": "Path to Hugging Face training dataset."}
)

eval_dataset_name: Optional[str] = field(
    default="the_pile", 
    metadata={"help": "Path to Hugging Face validation dataset."}
)

choose_train_split: Optional[str] = field(
    default="train", 
    metadata={"help": "Choose Hugging Face training dataset split."}
)

choose_eval_split: Optional[str] = field(
    default="train", 
    metadata={"help": "Choose Hugging Face validation dataset split."}
)

remove_train_columns: ClassVar[list[str]] = field(
    default = ['meta'], 
    metadata={"help": "Train dataset columns to remove."}
)

remove_eval_columns: ClassVar[list[str]] = field(
    default = ['meta'], 
    metadata={"help": "Validation dataset columns to remove."}
)

seed: Optional[int] = field(
    default=42, 
    metadata={"help": "Random seed used for reproducibility."}
)

tokenizer_name: Optional[str] = field(
    default="gpt2",
    metadata={"help": "Tokenizer name."}
)

tokenizer_seq_length: Optional[int] = field(
    default=512, 
    metadata={"help": "Sequence lengths used for tokenizing examples."}
)

select_input_string: Optional[str] = field(
    default="text", 
    metadata={"help": "Select the key to used as the input string column."}
)

batch_size: Optional[int] = field(
    default=16, 
    metadata={"help": "Batch size for training and validation."}
)

save_to_path: Optional[str] = field(
    default="''", 
    metadata={"help": "Save the dataset to local disk."}
)

Let me know if this solves your issue.

Best,

Enrico

@htthYjh
Copy link
Author

htthYjh commented Aug 12, 2022

Ok,let me check,thank you very much

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants