-
-
Notifications
You must be signed in to change notification settings - Fork 74
Add requirement.txt
#4
Comments
Hi @biofoolgreen , A package manager with pip will be implemented in the future. See the TODO and this PR. Best, Enrico |
Thanks @conceptofmind! I've generated requirements.txt by manually installing all the packages, and then exported with pipreqs. Seeing below:
Hopefully, it's useful for someone else. However, I can't run the model successfully when running
Any idea? |
Hi @biofoolgreen , I am not currently receiving this error on my end when running a few test cases. I will work through a minimal reproducible example to see if I can get a matching error. I will test with local environment data loaders as well. The parts of the code related to that error are: # Remove unused columns from the training dataset
load_train_data = load_train_data.remove_columns(args.train_columns) And: train_columns: ClassVar[list[str]] = field(
default = ['meta'],
metadata={"help": "Train dataset columns to remove."}
) Additionally, it seems as if there is a bug in Hugging Face datasets unrelated specifically to your error, and I will have to open an issue with them to get it resolved. A few other notes:
Best, Enrico |
I set up a minimum reproducible example in a Jupyter Notebook and seems to be working fine. I will have to do a further review. tokenizer = GPT2Tokenizer(vocab_file='/token/vocab.json', merges_file='/token/merges.txt')
print(tokenizer.vocab_size)
load_train_data = load_dataset("the_pile", split="train", streaming=True)
load_train_data = load_train_data.remove_columns(['meta'])
print(next(iter(load_train_data)))
shuffled_train_files = load_train_data.shuffle(seed=42, buffer_size=10_000)
print(next(iter(shuffled_train_files)))
def tokenize(examples):
seq_length = 2048
examples = tokenizer(examples["text"])
concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
total_length = len(concatenated_examples[list(examples.keys())[0]])
if total_length >= seq_length:
total_length = (total_length // seq_length) * seq_length
result = {
k: [t[i : i + seq_length] for i in range(0, total_length, seq_length)]
for k, t in concatenated_examples.items()
}
result["labels"] = copy.deepcopy(result["input_ids"])
return result
tokenized_train_dataset = shuffled_train_files.map(tokenize, batched=True, remove_columns = ['text'])
print(next(iter(tokenized_train_dataset)))
print(len(next(iter(tokenized_train_dataset))['input_ids'])) |
If decided to use poetry, |
Hi @biofoolgreen , I rebuilt the data loader to work locally: https://github.com/conceptofmind/LaMDA-pytorch/blob/main/lamda_pytorch/build_dataloader.py A few things you are going to have to take into consideration if you are going to use the provided Pile dataset:
The configuration for the data loader looks like this: """
Configuration for data loader.
"""
use_huggingface: bool = field(
default = True,
metadata = {'help': 'Whether to use huggingface datasets'}
)
train_dataset_name: Optional[str] = field(
default="the_pile",
metadata={"help": "Path to Hugging Face training dataset."}
)
eval_dataset_name: Optional[str] = field(
default="the_pile",
metadata={"help": "Path to Hugging Face validation dataset."}
)
choose_train_split: Optional[str] = field(
default="train",
metadata={"help": "Choose Hugging Face training dataset split."}
)
choose_eval_split: Optional[str] = field(
default="train",
metadata={"help": "Choose Hugging Face validation dataset split."}
)
remove_train_columns: ClassVar[list[str]] = field(
default = ['meta'],
metadata={"help": "Train dataset columns to remove."}
)
remove_eval_columns: ClassVar[list[str]] = field(
default = ['meta'],
metadata={"help": "Validation dataset columns to remove."}
)
seed: Optional[int] = field(
default=42,
metadata={"help": "Random seed used for reproducibility."}
)
tokenizer_name: Optional[str] = field(
default="gpt2",
metadata={"help": "Tokenizer name."}
)
tokenizer_seq_length: Optional[int] = field(
default=512,
metadata={"help": "Sequence lengths used for tokenizing examples."}
)
select_input_string: Optional[str] = field(
default="text",
metadata={"help": "Select the key to used as the input string column."}
)
batch_size: Optional[int] = field(
default=16,
metadata={"help": "Batch size for training and validation."}
)
save_to_path: Optional[str] = field(
default="''",
metadata={"help": "Save the dataset to local disk."}
) Let me know if you are still getting the previous error. Best, Enrico |
I guess @biofoolgreen facing "subscript for class list will generate runtime exception" error at file "lamda_pytorch\config\config.py" and line 84, 88. This is an PEP 563 – Postponed Evaluation of Annotations topic which fixes by importing "annotations" from "future" library. I also faced same issue and fixed with the solution (adding missing import) above. |
Hi @msaidbilgehan , What version of python are you using? I have been reading more into the error and it seems that typing annotations with dataclasses were changed later in 3.8. I may have to put a note or remove that part of the configuration completely. Although that may make it more difficult for others who are not familiar with Huggingface datasets. Thank you, Enrico |
Hi, is there any plan to add a requirement.txt that allows us to install needed packages with pip? Thanks.
The text was updated successfully, but these errors were encountered: