Seminar Software Engineering, University of Bern & University of Fribourg, Fall 2022
Project by: Nicolas Maier
Supervised by: Sandra Greiner
Course professor: Timo Kehrer
Note: Python 3.8.10 was used
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
Edit the .env
file, and set the OUTPUT_PATH
variable to an arbitrary path where the dataset and model will be stored
You can also change the default cache directory for the HF datasets library (it will become huge):
export HF_DATASETS_CACHE="/<path-to-cache-directory>"
wget -O dataset/java.zip https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/java.zip
unzip -d dataset dataset/java.zip
The collection of Java methods from CodeSearchNet is downloaded; the raw compressed data is available in dataset/java/final/jsonl/<split>/java_<split>_<part>.jsonl.gz
cd dataset
python decompress.py
cd ..
The dataset is decompressed, and the resulting jsonl files are available in dataset/codesearchnet-java-decompressed/java_<split>_<part>.jsonl
This step was performed on a local computer in order to use Eclipse's GUI; resulting files were then transferred on the server
Note: Eclipse Modeling Tools 2022-09 was used
2 Eclipse Workspaces are provided:
Target workspace
- Open the
eclipse_workspaces/workspace_target
workspace in Eclipse - Use
File > Open Projects from File System...
to open theTargetProject
project - Close this instance of Eclipse
Plugin workspace
- In
Help > Install New Software...
install theMoDisco SDK
plugin - Use
File > Open Projects from File System...
to open theDiscoverDataset
project
Use the provided launch configuration (discover_dataset
) to perform the discovery
The resulting XMI is available in dataset/codesearchnet-java-discovered/java_<split>_<part>.jsonl
cd prepare_dataset
python prepare_dataset.py
cd ..
The HF dataset is now available in the <OUTPUT_PATH>/dataset/seq_dataset_filtered
directory
cd train_model
python finetune_codet5.py
cd ..
The model is now available in the <OUTPUT_PATH>/model
directory
cd evaluate
python eval_generate.py
python eval_compute.py
cd ..
Results:
Exact match score: 0.9825...
CodeBLEU score: 0.9988...
You can edit the visualization/test_code.txt
file to change the code that will be sent to the model
Example code:
public void test(int x, String y) {
A a = new A();
a.foo1();
a.foo2(42);
while (a.foo3()) {
B b = new B();
b.bar1();
if (b.bar2() && xyz) {
b.bar3(42);
} else {
b.bar4();
}
}
}
cd visualization
python generate_seq_diagram.py
cd ..
Open the generated sequence diagram (visualization/test_output.svg
file)
Example output diagram: