FiCoVuL: Fine-grained Code Vulnerability Localization with Graph Attention Networks

Abstract

FiCoVuL is a state-of-the-art system for fine-grained code vulnerability detection and localization. Leveraging Graph Attention Networks (GAT) and Program Dependency Graphs (PDG), FiCoVuL accurately identifies security vulnerabilities in C code and precisely locates the affected code positions.

Overview

Code vulnerability detection is critical for software security, but existing methods often struggle with fine-grained localization. FiCoVuL addresses this challenge by:

Converting C code into Program Dependency Graphs (PDG) using Joern
Employing Graph Attention Networks (GAT) to capture complex code semantics
Providing both graph-level vulnerability classification and node-level vulnerability localization
Supporting multiple datasets and preprocessing methods

Key Features

Graph-Based Representation: Converts code to PDG to preserve structural and semantic information
Attention Mechanism: Uses GAT to focus on relevant code components
Fine-Grained Localization: Identifies specific vulnerable nodes within the code
Extensible Design: Supports multiple datasets and preprocessing techniques
User-Friendly Interface: Easy-to-use scripts for training, testing, and custom code analysis

Prerequisites

Python: 3.8 or higher
Conda: For environment management
Java/Scala: Required for Joern tool

Installation

1. Clone the Repository

git clone https://github.com/Netsec-SJTU/FiCoVuL.git
cd FiCoVuL

2. Create and Activate Conda Environment

conda create -n ficovul python=3.8
conda activate ficovul

3. Install Python Dependencies

pip install -r requirements.txt

4. Setup Joern Tool

FiCoVuL uses Joern to convert C code into Program Dependency Graphs:

# Navigate to the Doxygen directory (in the project root)
cd Doxygen
# Download Joern CLI
wget https://github.com/joernio/joern/releases/download/v1.1.190/joern-cli.zip
unzip joern-cli.zip

# Make Joern executable
chmod +x joern-cli/joern
chmod +x joern-cli/joern-parse

Data Processing

Input Format

Input data should be C language source code files (.c extension)
Each file should ideally contain a single function for optimal processing and localization

1. Generate Program Dependency Graphs

# Run from project root directory
cd Doxygen
python run_joern.py --dataset MY_DATASET

2. Preprocess Data

Convert Joern-generated graphs to model-compatible format:

# Run from project root directory
cd FiCoVuL
python dataset/data_preprocess.py

Preprocessed Data Location

data/datasets/[DATASET_NAME]_RAW/

train.json, valid.json, test.json: Split datasets
node_type_dict.pkl, word_lexical_dict.pkl, word_value_dict.pkl: Vocabulary mappings

Model Training

Basic Training

# Run from project root directory
cd FiCoVuL
python train3.py --task_name my_task

Configuration

Edit configs/config.json5 to customize training parameters:

{
  "task_name": "my_task",
  "dataset_name": "CroVul",
  "preprocess": {
    "method": "RAW",
    "proportion": [8, 1, 1]  // [train, validation, test] split ratio
  },
  "model": {
    "num_of_gat_layers": 4,
    "num_heads_per_gat_layer": [8, 4, 4, 6],
    "num_features_per_gat_layer": [64, 32, 64, 64, 64]
  },
  "run": {
    "num_of_epochs": 200,
    "batch_size": 128,
    "lr": 0.0005
  }
}

Training Results

Training results are stored in three locations:

Model Binaries
- Path: data/models/binaries/
- Format: gat_{dataset_name}_{task_name}_{number}.pth
- Example: gat_CroVul_my_task_000000.pth
Model Checkpoints
- Path: data/models/checkpoints/
- Best model: gat_{dataset_name}_{task_name}_best.pth
- Epoch checkpoints: gat_{dataset_name}_ckpt_{task_name}_epoch_{epoch+1}.pth
TensorBoard Logs
- Path: data/runs/{task_name}/
- View: tensorboard --logdir data/runs/

Model Testing

Batch Testing

Configure test settings in configs/config.json5:

{
  "run": {
    "load_model": "data/models/binaries/gat_CroVul_my_task_000000.pth",
    "test_only": true,
    "loader_batch_size": 1
  }
}

Run batch testing:

# Run from project root directory
cd FiCoVuL
python test3.py

Test Results

Test results are saved in:

data/results/

Filename format: test_results_{model_name}_{timestamp}.json
Contains graph-level predictions and node-level vulnerability localization

Configuration Parameters

Parameter	Description	Default
`dataset_name`	Dataset identifier	`CroVul`
`preprocess.method`	Preprocessing technique	`RAW`
`model.num_of_gat_layers`	Number of GAT layers	`4`
`model.num_heads_per_gat_layer`	Attention heads per GAT layer	`[8, 4, 4, 6]`
`run.num_of_epochs`	Training epochs	`200`
`run.batch_size`	Batch size	`128`
`run.lr`	Learning rate	`0.0005`
`run.classification_threshold`	Vulnerability classification threshold	`0.5`

Project Structure

├── configs/              # Configuration files
│   └── config.json5
├── dataset/              # Data processing module
│   ├── data_loader.py    # Data loading utilities
│   ├── data_preprocess.py # Data preprocessing pipeline
│   └── metrics.py        # Evaluation metrics
├── models/               # Model definitions
│   ├── layers/           # Neural network layers
│   │   ├── embedding_layer.py          # Node embedding
│   │   ├── gat_layer.py                # Graph Attention Layer
│   │   └── nodes_to_graph_representation.py # Graph-level representation
│   └── GATClassification.py # GAT model for vulnerability detection
├── utils/                # Utility functions
├── data/                 # Data and results storage
│   ├── datasets/         # Processed datasets
│   ├── models/           # Model files
│   │   ├── binaries/     # Final model binaries
│   │   └── checkpoints/  # Training checkpoints
│   └── results/          # Test results
├── train3.py             # Training script
├── test3.py              # Testing script
└── README.md             # Project documentation

Usage Examples

Example 1: Train on Custom Dataset

# Edit config.json5 to set dataset_name to "MY_DATASET"
python train3.py --task_name my_custom_train

Example 2: Test with Pretrained Model

# Edit config.json5 to set load_model path
python test3.py

Performance Metrics

FiCoVuL reports the following metrics:

Graph-level: Precision, Recall, Accuracy, F1-score
Node-level: Precision, Recall, Accuracy, F1-score
Weighted: Combined metrics with graph-level (70%) and node-level (30%) weights

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

For questions, issues, or collaboration:

GitHub: https://github.com/Netsec-SJTU/FiCoVuL
Issues: https://github.com/Netsec-SJTU/FiCoVuL/issues

Acknowledgments

Special thanks to the Joern development team for their powerful code analysis tool

Disclaimer: This tool is for research and educational purposes only. Always verify results with manual code review before making security decisions.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Doxygen		Doxygen
FiCoVuL		FiCoVuL
.gitignore		.gitignore
README.md		README.md
environment.yaml		environment.yaml
requirements.txt		requirements.txt

Netsec-SJTU/FiCoVuL

Folders and files

Latest commit

History

Repository files navigation