Skip to content
This repository has been archived by the owner on Feb 24, 2024. It is now read-only.

Add requirement.txt #4

Open
biofoolgreen opened this issue Jul 7, 2022 · 8 comments
Open

Add requirement.txt #4

biofoolgreen opened this issue Jul 7, 2022 · 8 comments

Comments

@biofoolgreen
Copy link

Hi, is there any plan to add a requirement.txt that allows us to install needed packages with pip? Thanks.

@conceptofmind
Copy link
Owner

Hi @biofoolgreen ,

A package manager with pip will be implemented in the future. See the TODO and this PR.

Best,

Enrico

@biofoolgreen
Copy link
Author

Thanks @conceptofmind!

I've generated requirements.txt by manually installing all the packages, and then exported with pipreqs. Seeing below:

colossalai==0.1.3
datasets==2.2.2
einops==0.4.1
sentencepiece==0.1.96
torch==1.11.0
transformers==4.11.3
wandb==0.12.21

Hopefully, it's useful for someone else. However, I can't run the model successfully when running $ python train.py. The error messages:

Traceback (most recent call last):
  File "train.py", line 9, in <module>
    from lamda_pytorch.config.config import CFG
  File "/localdata/liguoying/LaMDA-pytorch/lamda_pytorch/__init__.py", line 1, in <module>
    from lamda_pytorch.lamda_pytorch import LaMDA
  File "/localdata/liguoying/LaMDA-pytorch/lamda_pytorch/lamda_pytorch.py", line 9, in <module>
    from lamda_pytorch.config.config import CFG
  File "/localdata/liguoying/LaMDA-pytorch/lamda_pytorch/config/config.py", line 5, in <module>
    class CFG:
  File "/localdata/liguoying/LaMDA-pytorch/lamda_pytorch/config/config.py", line 65, in CFG
    train_columns: ClassVar[list[str]] = field(
TypeError: 'type' object is not subscriptable

Any idea?

@conceptofmind
Copy link
Owner

Hi @biofoolgreen ,

I am not currently receiving this error on my end when running a few test cases. I will work through a minimal reproducible example to see if I can get a matching error. I will test with local environment data loaders as well.

The parts of the code related to that error are:

# Remove unused columns from the training dataset
load_train_data = load_train_data.remove_columns(args.train_columns)

And:

train_columns: ClassVar[list[str]] = field(
    default = ['meta'], 
    metadata={"help": "Train dataset columns to remove."}
)

Additionally, it seems as if there is a bug in Hugging Face datasets unrelated specifically to your error, and I will have to open an issue with them to get it resolved.

A few other notes:

  • You are not able to run train.py directly. Colossal requires something like colossalai run --nproc_per_node 1 train.py
  • Sentencepiece is not necessarily required or even currently implemented. You would have to train the sentencepiece tokenizer on the dataset which you are using and then convert it to one which can be managed by Hugging Face tokenizers. I will be adding a default Hugging Face gpt2 tokenizer which you can use as well. A tokenizer is required to run the data loader and model.

Best,

Enrico

@conceptofmind
Copy link
Owner

I set up a minimum reproducible example in a Jupyter Notebook and seems to be working fine. I will have to do a further review.

tokenizer = GPT2Tokenizer(vocab_file='/token/vocab.json', merges_file='/token/merges.txt')
print(tokenizer.vocab_size)

load_train_data = load_dataset("the_pile", split="train", streaming=True)
load_train_data = load_train_data.remove_columns(['meta'])
print(next(iter(load_train_data)))

shuffled_train_files = load_train_data.shuffle(seed=42, buffer_size=10_000)
print(next(iter(shuffled_train_files)))

def tokenize(examples):
    seq_length = 2048
    examples = tokenizer(examples["text"])
    concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    if total_length >= seq_length:
        total_length = (total_length // seq_length) * seq_length

    result = {
        k: [t[i : i + seq_length] for i in range(0, total_length, seq_length)]
        for k, t in concatenated_examples.items()
    }

    result["labels"] = copy.deepcopy(result["input_ids"])

    return result

tokenized_train_dataset = shuffled_train_files.map(tokenize, batched=True, remove_columns = ['text'])
print(next(iter(tokenized_train_dataset)))
print(len(next(iter(tokenized_train_dataset))['input_ids']))

@Aleksandar1932
Copy link

If decided to use poetry, requirements.txt is not needed, since all of the dependencies are defined in pyproject.toml

@conceptofmind
Copy link
Owner

Hi @biofoolgreen ,

I rebuilt the data loader to work locally: https://github.com/conceptofmind/LaMDA-pytorch/blob/main/lamda_pytorch/build_dataloader.py

A few things you are going to have to take into consideration if you are going to use the provided Pile dataset:

  • The Pile dataset is over 1TB of data. You will likely need a storage device with up to 2TB of space for everything including the tokenizer and saved models.
  • If you want to use a different dataset, I provided a configuration file with different fields that can be adjusted. You can find a bunch of different text generation datasets on the Huggingface Datasets hub. I will still be uploading the GODEL conversational dataset as well.

The configuration for the data loader looks like this:

"""
Configuration for data loader.
"""

use_huggingface: bool = field(
    default = True,
    metadata = {'help': 'Whether to use huggingface datasets'}
)

train_dataset_name: Optional[str] = field(
    default="the_pile", 
    metadata={"help": "Path to Hugging Face training dataset."}
)

eval_dataset_name: Optional[str] = field(
    default="the_pile", 
    metadata={"help": "Path to Hugging Face validation dataset."}
)

choose_train_split: Optional[str] = field(
    default="train", 
    metadata={"help": "Choose Hugging Face training dataset split."}
)

choose_eval_split: Optional[str] = field(
    default="train", 
    metadata={"help": "Choose Hugging Face validation dataset split."}
)

remove_train_columns: ClassVar[list[str]] = field(
    default = ['meta'], 
    metadata={"help": "Train dataset columns to remove."}
)

remove_eval_columns: ClassVar[list[str]] = field(
    default = ['meta'], 
    metadata={"help": "Validation dataset columns to remove."}
)

seed: Optional[int] = field(
    default=42, 
    metadata={"help": "Random seed used for reproducibility."}
)

tokenizer_name: Optional[str] = field(
    default="gpt2",
    metadata={"help": "Tokenizer name."}
)

tokenizer_seq_length: Optional[int] = field(
    default=512, 
    metadata={"help": "Sequence lengths used for tokenizing examples."}
)

select_input_string: Optional[str] = field(
    default="text", 
    metadata={"help": "Select the key to used as the input string column."}
)

batch_size: Optional[int] = field(
    default=16, 
    metadata={"help": "Batch size for training and validation."}
)

save_to_path: Optional[str] = field(
    default="''", 
    metadata={"help": "Save the dataset to local disk."}
)

Let me know if you are still getting the previous error.

Best,

Enrico

@msaidbilgehan
Copy link

I guess @biofoolgreen facing "subscript for class list will generate runtime exception" error at file "lamda_pytorch\config\config.py" and line 84, 88. This is an PEP 563 – Postponed Evaluation of Annotations topic which fixes by importing "annotations" from "future" library.

I also faced same issue and fixed with the solution (adding missing import) above.

@conceptofmind
Copy link
Owner

conceptofmind commented Aug 20, 2022

Hi @msaidbilgehan ,

What version of python are you using? I have been reading more into the error and it seems that typing annotations with dataclasses were changed later in 3.8. I may have to put a note or remove that part of the configuration completely. Although that may make it more difficult for others who are not familiar with Huggingface datasets.

Thank you,

Enrico

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants