commitgen_dataset.zip is the data needed to train the model unzip it in order for training to work properly
run:
python scripts/harvest_commits.py
to get data or add to dataset based on the github repos in the REPO_LIST within that file, can modify at will
after data gathering run:
python data/prepare_bpe.py
to reset meta, val, and bin files (tokenizer + detokenizer) if you want to run the model at BPE level, or run data/prepare_char.py if you want to run the model at char level
to run training run:
python train.py config/train_vm_bpe.py
or whatever training config in the config folder that you want to use
to test model with git diffs run:
python generate_commit.py
and you will be prompted to enter in a diff and it will give you back a commit message, or you can add a hardcoded diff and test that out too
-
note: the model only works for pretty small diffs since it only have around 110M parameters when run on the commitgen dataset and trains in about 2 1/2 hours when running on one Tesla T4 GPU
-
credit: model.py and train.py come from the nanoGPT repo: https://github.com/karpathy/nanoGPT/tree/master found here