Add `requirement.txt` #4

biofoolgreen · 2022-07-07T11:35:30Z

Hi, is there any plan to add a requirement.txt that allows us to install needed packages with pip? Thanks.

conceptofmind · 2022-07-07T16:36:55Z

Hi @biofoolgreen ,

A package manager with pip will be implemented in the future. See the TODO and this PR.

Best,

Enrico

biofoolgreen · 2022-07-08T11:14:43Z

Thanks @conceptofmind!

I've generated requirements.txt by manually installing all the packages, and then exported with pipreqs. Seeing below:

colossalai==0.1.3
datasets==2.2.2
einops==0.4.1
sentencepiece==0.1.96
torch==1.11.0
transformers==4.11.3
wandb==0.12.21

Hopefully, it's useful for someone else. However, I can't run the model successfully when running $ python train.py. The error messages:

Traceback (most recent call last):
  File "train.py", line 9, in <module>
    from lamda_pytorch.config.config import CFG
  File "/localdata/liguoying/LaMDA-pytorch/lamda_pytorch/__init__.py", line 1, in <module>
    from lamda_pytorch.lamda_pytorch import LaMDA
  File "/localdata/liguoying/LaMDA-pytorch/lamda_pytorch/lamda_pytorch.py", line 9, in <module>
    from lamda_pytorch.config.config import CFG
  File "/localdata/liguoying/LaMDA-pytorch/lamda_pytorch/config/config.py", line 5, in <module>
    class CFG:
  File "/localdata/liguoying/LaMDA-pytorch/lamda_pytorch/config/config.py", line 65, in CFG
    train_columns: ClassVar[list[str]] = field(
TypeError: 'type' object is not subscriptable

Any idea?

conceptofmind · 2022-07-08T15:37:05Z

Hi @biofoolgreen ,

I am not currently receiving this error on my end when running a few test cases. I will work through a minimal reproducible example to see if I can get a matching error. I will test with local environment data loaders as well.

The parts of the code related to that error are:

# Remove unused columns from the training dataset
load_train_data = load_train_data.remove_columns(args.train_columns)

And:

train_columns: ClassVar[list[str]] = field(
    default = ['meta'], 
    metadata={"help": "Train dataset columns to remove."}
)

Additionally, it seems as if there is a bug in Hugging Face datasets unrelated specifically to your error, and I will have to open an issue with them to get it resolved.

A few other notes:

You are not able to run train.py directly. Colossal requires something like colossalai run --nproc_per_node 1 train.py
Sentencepiece is not necessarily required or even currently implemented. You would have to train the sentencepiece tokenizer on the dataset which you are using and then convert it to one which can be managed by Hugging Face tokenizers. I will be adding a default Hugging Face gpt2 tokenizer which you can use as well. A tokenizer is required to run the data loader and model.

Best,

Enrico

conceptofmind · 2022-07-08T15:58:32Z

I set up a minimum reproducible example in a Jupyter Notebook and seems to be working fine. I will have to do a further review.

tokenizer = GPT2Tokenizer(vocab_file='/token/vocab.json', merges_file='/token/merges.txt')
print(tokenizer.vocab_size)

load_train_data = load_dataset("the_pile", split="train", streaming=True)
load_train_data = load_train_data.remove_columns(['meta'])
print(next(iter(load_train_data)))

shuffled_train_files = load_train_data.shuffle(seed=42, buffer_size=10_000)
print(next(iter(shuffled_train_files)))

def tokenize(examples):
    seq_length = 2048
    examples = tokenizer(examples["text"])
    concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    if total_length >= seq_length:
        total_length = (total_length // seq_length) * seq_length

    result = {
        k: [t[i : i + seq_length] for i in range(0, total_length, seq_length)]
        for k, t in concatenated_examples.items()
    }

    result["labels"] = copy.deepcopy(result["input_ids"])

    return result

tokenized_train_dataset = shuffled_train_files.map(tokenize, batched=True, remove_columns = ['text'])
print(next(iter(tokenized_train_dataset)))
print(len(next(iter(tokenized_train_dataset))['input_ids']))

Aleksandar1932 · 2022-07-09T15:53:05Z

If decided to use poetry, requirements.txt is not needed, since all of the dependencies are defined in pyproject.toml

conceptofmind · 2022-07-20T22:26:53Z

Hi @biofoolgreen ,

I rebuilt the data loader to work locally: https://github.com/conceptofmind/LaMDA-pytorch/blob/main/lamda_pytorch/build_dataloader.py

A few things you are going to have to take into consideration if you are going to use the provided Pile dataset:

The Pile dataset is over 1TB of data. You will likely need a storage device with up to 2TB of space for everything including the tokenizer and saved models.
If you want to use a different dataset, I provided a configuration file with different fields that can be adjusted. You can find a bunch of different text generation datasets on the Huggingface Datasets hub. I will still be uploading the GODEL conversational dataset as well.

The configuration for the data loader looks like this:

"""
Configuration for data loader.
"""

use_huggingface: bool = field(
    default = True,
    metadata = {'help': 'Whether to use huggingface datasets'}
)

train_dataset_name: Optional[str] = field(
    default="the_pile", 
    metadata={"help": "Path to Hugging Face training dataset."}
)

eval_dataset_name: Optional[str] = field(
    default="the_pile", 
    metadata={"help": "Path to Hugging Face validation dataset."}
)

choose_train_split: Optional[str] = field(
    default="train", 
    metadata={"help": "Choose Hugging Face training dataset split."}
)

choose_eval_split: Optional[str] = field(
    default="train", 
    metadata={"help": "Choose Hugging Face validation dataset split."}
)

remove_train_columns: ClassVar[list[str]] = field(
    default = ['meta'], 
    metadata={"help": "Train dataset columns to remove."}
)

remove_eval_columns: ClassVar[list[str]] = field(
    default = ['meta'], 
    metadata={"help": "Validation dataset columns to remove."}
)

seed: Optional[int] = field(
    default=42, 
    metadata={"help": "Random seed used for reproducibility."}
)

tokenizer_name: Optional[str] = field(
    default="gpt2",
    metadata={"help": "Tokenizer name."}
)

tokenizer_seq_length: Optional[int] = field(
    default=512, 
    metadata={"help": "Sequence lengths used for tokenizing examples."}
)

select_input_string: Optional[str] = field(
    default="text", 
    metadata={"help": "Select the key to used as the input string column."}
)

batch_size: Optional[int] = field(
    default=16, 
    metadata={"help": "Batch size for training and validation."}
)

save_to_path: Optional[str] = field(
    default="''", 
    metadata={"help": "Save the dataset to local disk."}
)

Let me know if you are still getting the previous error.

Best,

Enrico

msaidbilgehan · 2022-08-19T10:33:32Z

I guess @biofoolgreen facing "subscript for class list will generate runtime exception" error at file "lamda_pytorch\config\config.py" and line 84, 88. This is an PEP 563 – Postponed Evaluation of Annotations topic which fixes by importing "annotations" from "future" library.

I also faced same issue and fixed with the solution (adding missing import) above.

conceptofmind · 2022-08-20T02:03:18Z

Hi @msaidbilgehan ,

What version of python are you using? I have been reading more into the error and it seems that typing annotations with dataclasses were changed later in 3.8. I may have to put a note or remove that part of the configuration completely. Although that may make it more difficult for others who are not familiar with Huggingface datasets.

Thank you,

Enrico

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `requirement.txt` #4

Add `requirement.txt` #4

biofoolgreen commented Jul 7, 2022

conceptofmind commented Jul 7, 2022

biofoolgreen commented Jul 8, 2022

conceptofmind commented Jul 8, 2022

conceptofmind commented Jul 8, 2022

Aleksandar1932 commented Jul 9, 2022

conceptofmind commented Jul 20, 2022

msaidbilgehan commented Aug 19, 2022

conceptofmind commented Aug 20, 2022 •

edited

Loading

Add requirement.txt #4

Add requirement.txt #4

Comments

biofoolgreen commented Jul 7, 2022

conceptofmind commented Jul 7, 2022

biofoolgreen commented Jul 8, 2022

conceptofmind commented Jul 8, 2022

conceptofmind commented Jul 8, 2022

Aleksandar1932 commented Jul 9, 2022

conceptofmind commented Jul 20, 2022

msaidbilgehan commented Aug 19, 2022

conceptofmind commented Aug 20, 2022 • edited Loading

Add `requirement.txt` #4

Add `requirement.txt` #4

conceptofmind commented Aug 20, 2022 •

edited

Loading