Skip to content

blackboxo/SplitGNN

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SplitGNN

This is the PyTorch implementation for the CIKM 2023 paper:

SplitGNN: Spectral Graph Neural Network for Fraud Detection against Heterophily

Bin Wu, Xinyu Yao, Boyan Zhang, Kuo-Ming Chao, Yinsheng Li

model

Dependencies

  • Python >= 3.8
  • numpy >= 1.22.4
  • scipy >= 1.4.1
  • scikit-learn >= 1.1.2
  • PyTorch >= 1.11.0
  • DGL >= 0.9.1

Usage

  • src/: includes all code scripts.
  • data/: includes original datasets:
    • YelpChi.zip: The original dataset of YelpChi, which contains hotel and restaurant reviews filtered (spam) and recommended (legitimate) by Yelp.
    • Amazon.zip: The original dataset of Amazon, which contains product reviews under the Musical Instruments category.
    • FDCompCN.zip: The processed dataset of FDCompCN, which contains financial statement fraud of companies in China from CSMAR database.
  • config/: includes the setting of parameters for two datasets.
    • yelp.yaml: The general parameters of YelpChi.
    • amazon.yaml: The general parameters of Amazon.
    • comp.yaml: The general parameters of FDCompCN.
  • result/: includes the results of models.

Model Training

We take YelpChi as an example to illustrate the usage of repository.

# Unzip the dataset
upzip ./data/YelpChi.zip ./data/

# Move to src/
cd src/

# Convert the original dataset to dgl graph
# The generated dgl graph contains the features, graph structure and edge labels.
python data_preprocess.py --dataset yelp

# Train and test the dataset
# If you want to change the parameters in training process, you can modify the corresponding yaml file in config.
python train.py --dataset yelp 

Data Description of FDCompCN

Dataset #Nodes #Fraud(%) #Features Relation #Edges
FDCompCN 5,317 559 (10.5%) 57 C-I-C
C-S-C
C-P-C
Homo
5,686
760
1,043
7,407

A new fraud detection dataset FDCompCN for detecting financial statement fraud of companies in China. We construct a multi-relation graph based on the supplier, customer, shareholder, and financial information disclosed in the financial statements of Chinese companies. These data are obtained from the China Stock Market and Accounting Research (CSMAR) database. We select samples between 2020 and 2023, including 5,317 publicly listed Chinese companies traded on the Shanghai, Shenzhen, and Beijing Stock Exchanges.

FDCompCN has three relations:

  1. C-I-C that connects companies that have investment relationships.
  2. C-P-C that connects companies and their disclosed customers.
  3. C-S-C that connects companies and their disclosed suppliers.

Each company contains basic and financial statement information. The basic information includes registered capital, currency, operating status, company type, industry, city, personnel size, and the number of insured individuals. The financial statement information includes such as long-term accounts receivable, long-term liabilities, and total assets. The original financial statement information contains 149 indicators. We retain 49 financial indicators. Finally, we process the basic information and financial statement information into 57-dimensional features. Financial statement fraud includes seven types of violations disclosed by Chinese regulators, including inflated profits, inflated assets, false statements, delay in disclosure, omission of significant information, fraudulent disclosures, and general accounting irregularities. Companies with more than three violations are labeled as fraudulent samples, while other companies are labeled as benign. 559 fraud samples and 4758 benign samples are ultimately obtained, with fraud samples accounting for 10.51%.

Reproduce Results

The hyperparameter of SplitGNN to reproduce the results.

Paramater YelpChi Amazon FDCompCN
learning rate 0.01 0.1 0.001
weight decay 0.00005 0.00005 0.00005
gamma 0.6 0.4 0.8
n_hidden 8 8 8
order C 2 2 3
tunable K 1 1 2
dropout 0.1 0.1 0.1
max epoch 1000 1000 1000
early stop 100 50 100

Run on your Datasets

To run SplitGNN on your datasets, you need to prepare the following data:

  • A homogeneous or multi-relation graph
  • Node labels
  • Node features

Transform the data into DGL format using the code in data_preprocess.py as a reference.

Citation

Please cite the paper if you use our code or data.

@inproceedings{10.1145/3583780.3615067,
  author = {Wu, Bin and Yao, Xinyu and Zhang, Boyan and Chao, Kuo-Ming and Li, Yinsheng},
  title = {SplitGNN: Spectral Graph Neural Network for Fraud Detection against Heterophily},
  year = {2023},
  url = {https://doi.org/10.1145/3583780.3615067},
  booktitle = {Proceedings of the 32nd ACM International Conference on Information and Knowledge Management},
  series = {CIKM '23}
}

About

CIKM 2023 Paper

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%