FiCoVuL is a state-of-the-art system for fine-grained code vulnerability detection and localization. Leveraging Graph Attention Networks (GAT) and Program Dependency Graphs (PDG), FiCoVuL accurately identifies security vulnerabilities in C code and precisely locates the affected code positions.
Code vulnerability detection is critical for software security, but existing methods often struggle with fine-grained localization. FiCoVuL addresses this challenge by:
- Converting C code into Program Dependency Graphs (PDG) using Joern
- Employing Graph Attention Networks (GAT) to capture complex code semantics
- Providing both graph-level vulnerability classification and node-level vulnerability localization
- Supporting multiple datasets and preprocessing methods
- Graph-Based Representation: Converts code to PDG to preserve structural and semantic information
- Attention Mechanism: Uses GAT to focus on relevant code components
- Fine-Grained Localization: Identifies specific vulnerable nodes within the code
- Extensible Design: Supports multiple datasets and preprocessing techniques
- User-Friendly Interface: Easy-to-use scripts for training, testing, and custom code analysis
- Python: 3.8 or higher
- Conda: For environment management
- Java/Scala: Required for Joern tool
git clone https://github.com/Netsec-SJTU/FiCoVuL.git
cd FiCoVuLconda create -n ficovul python=3.8
conda activate ficovulpip install -r requirements.txtFiCoVuL uses Joern to convert C code into Program Dependency Graphs:
# Navigate to the Doxygen directory (in the project root)
cd Doxygen
# Download Joern CLI
wget https://github.com/joernio/joern/releases/download/v1.1.190/joern-cli.zip
unzip joern-cli.zip
# Make Joern executable
chmod +x joern-cli/joern
chmod +x joern-cli/joern-parse- Input data should be C language source code files (
.cextension) - Each file should ideally contain a single function for optimal processing and localization
# Run from project root directory
cd Doxygen
python run_joern.py --dataset MY_DATASETConvert Joern-generated graphs to model-compatible format:
# Run from project root directory
cd FiCoVuL
python dataset/data_preprocess.pydata/datasets/[DATASET_NAME]_RAW/
train.json,valid.json,test.json: Split datasetsnode_type_dict.pkl,word_lexical_dict.pkl,word_value_dict.pkl: Vocabulary mappings
# Run from project root directory
cd FiCoVuL
python train3.py --task_name my_taskEdit configs/config.json5 to customize training parameters:
{
"task_name": "my_task",
"dataset_name": "CroVul",
"preprocess": {
"method": "RAW",
"proportion": [8, 1, 1] // [train, validation, test] split ratio
},
"model": {
"num_of_gat_layers": 4,
"num_heads_per_gat_layer": [8, 4, 4, 6],
"num_features_per_gat_layer": [64, 32, 64, 64, 64]
},
"run": {
"num_of_epochs": 200,
"batch_size": 128,
"lr": 0.0005
}
}Training results are stored in three locations:
-
Model Binaries
- Path:
data/models/binaries/ - Format:
gat_{dataset_name}_{task_name}_{number}.pth - Example:
gat_CroVul_my_task_000000.pth
- Path:
-
Model Checkpoints
- Path:
data/models/checkpoints/ - Best model:
gat_{dataset_name}_{task_name}_best.pth - Epoch checkpoints:
gat_{dataset_name}_ckpt_{task_name}_epoch_{epoch+1}.pth
- Path:
-
TensorBoard Logs
- Path:
data/runs/{task_name}/ - View:
tensorboard --logdir data/runs/
- Path:
- Configure test settings in
configs/config.json5:
{
"run": {
"load_model": "data/models/binaries/gat_CroVul_my_task_000000.pth",
"test_only": true,
"loader_batch_size": 1
}
}- Run batch testing:
# Run from project root directory
cd FiCoVuL
python test3.pyTest results are saved in:
data/results/
- Filename format:
test_results_{model_name}_{timestamp}.json - Contains graph-level predictions and node-level vulnerability localization
| Parameter | Description | Default |
|---|---|---|
dataset_name |
Dataset identifier | CroVul |
preprocess.method |
Preprocessing technique | RAW |
model.num_of_gat_layers |
Number of GAT layers | 4 |
model.num_heads_per_gat_layer |
Attention heads per GAT layer | [8, 4, 4, 6] |
run.num_of_epochs |
Training epochs | 200 |
run.batch_size |
Batch size | 128 |
run.lr |
Learning rate | 0.0005 |
run.classification_threshold |
Vulnerability classification threshold | 0.5 |
├── configs/ # Configuration files
│ └── config.json5
├── dataset/ # Data processing module
│ ├── data_loader.py # Data loading utilities
│ ├── data_preprocess.py # Data preprocessing pipeline
│ └── metrics.py # Evaluation metrics
├── models/ # Model definitions
│ ├── layers/ # Neural network layers
│ │ ├── embedding_layer.py # Node embedding
│ │ ├── gat_layer.py # Graph Attention Layer
│ │ └── nodes_to_graph_representation.py # Graph-level representation
│ └── GATClassification.py # GAT model for vulnerability detection
├── utils/ # Utility functions
├── data/ # Data and results storage
│ ├── datasets/ # Processed datasets
│ ├── models/ # Model files
│ │ ├── binaries/ # Final model binaries
│ │ └── checkpoints/ # Training checkpoints
│ └── results/ # Test results
├── train3.py # Training script
├── test3.py # Testing script
└── README.md # Project documentation
# Edit config.json5 to set dataset_name to "MY_DATASET"
python train3.py --task_name my_custom_train# Edit config.json5 to set load_model path
python test3.pyFiCoVuL reports the following metrics:
- Graph-level: Precision, Recall, Accuracy, F1-score
- Node-level: Precision, Recall, Accuracy, F1-score
- Weighted: Combined metrics with graph-level (70%) and node-level (30%) weights
This project is licensed under the MIT License - see the LICENSE file for details.
For questions, issues, or collaboration:
- GitHub: https://github.com/Netsec-SJTU/FiCoVuL
- Issues: https://github.com/Netsec-SJTU/FiCoVuL/issues
- Special thanks to the Joern development team for their powerful code analysis tool
Disclaimer: This tool is for research and educational purposes only. Always verify results with manual code review before making security decisions.