An end-to-end machine learning notebook project for detecting potentially fraudulent vehicle insurance claims. The project combines traditional claim attributes with synthetic telematics signals, then trains multiple complementary models and fuses their learned embeddings into a final fraud-risk classifier.
The pipeline is intentionally split into notebooks so each modeling branch can be inspected, rerun, and evaluated independently:
- Preprocess raw claim and telematics data.
- Train a tabular CatBoost fraud model.
- Train a temporal LSTM over ordered claim/telematics sequences.
- Train a graph neural network over claim relationship edges.
- Merge all branch embeddings and train a final fusion DNN.
Vehicle insurance fraud is a rare-event classification problem: most claims are legitimate, while fraudulent claims are costly and difficult to identify. This project approaches that imbalance by combining several views of the same claim:
- Tabular view: structured insurance fields such as vehicle category, policy type, claimant demographics, accident metadata, and claim history.
- Telematics view: driving behavior signals such as speed, braking, acceleration, night driving, distance, and idle time.
- Temporal view: recent neighboring claims ordered by available time fields, modeled as short sequences.
- Graph view: relationships between claims that share meaningful insurance attributes, modeled with a graph convolutional network.
- Fusion view: a final neural network that learns from all branch embeddings and model probabilities.
The result is a research-style fraud detection pipeline that emphasizes interpretability of stages, feature engineering, embedding generation, and threshold tuning for high-recall fraud screening.
.
|-- data/
| |-- fraud_oracle_with_telematics.csv # Raw source dataset with claim and telematics fields
| |-- preprocessed.csv # Cleaned, encoded, scaled canonical dataset
| |-- tabular_embeddings.csv # CatBoost branch embeddings and probabilities
| |-- temporal_embeddings.csv # LSTM branch embeddings and probabilities
| `-- graph_embeddings.csv # GCN branch embeddings and probabilities
|-- notebooks/
| |-- 01_preprocessing.ipynb # Raw data cleaning and feature engineering
| |-- 02_tabular_model.ipynb # CatBoost or sklearn fallback model
| |-- 03_temporal_model.ipynb # LSTM sequence model
| |-- 04_graph_model.ipynb # PyTorch Geometric GCN model
| `-- 05_fusion.ipynb # Final embedding fusion DNN and threshold tuning
|-- requirements.txt # Python dependencies
|-- LICENSE # MIT license
`-- README.md
The local environment directory fraud_env/ is ignored by .gitignore and should not be treated as source code.
The raw file is data/fraud_oracle_with_telematics.csv.
Current raw dataset shape:
- Rows:
15,420 - Columns:
43 - Target column:
FraudFound_P, normalized during preprocessing tofraudfound_p
Important raw feature groups include:
- Claim timing:
Month,WeekOfMonth,DayOfWeek,MonthClaimed,WeekOfMonthClaimed - Claim and policy details:
Fault,PolicyType,VehicleCategory,VehiclePrice,BasePolicy - Driver and policyholder fields:
Sex,MaritalStatus,Age,AgeOfPolicyHolder - Claim history:
PastNumberOfClaims,NumberOfSuppliments,Days_Policy_Accident,Days_Policy_Claim - Investigation indicators:
PoliceReportFiled,WitnessPresent,AgentType - Telematics fields:
avg_speed_kmph,max_speed_kmph,hard_brakes_per_trip,rapid_acceleration_events,trip_duration_minutes,distance_km,night_driving_ratio,urban_driving_ratio,harsh_cornering_events,idle_time_minutes
After preprocessing, the target distribution in the checked notebook output is:
legitimate claims: 14,497
fraud claims: 923
This means fraud is only about 6 percent of the dataset, so accuracy alone is not a useful success metric. The notebooks report ROC-AUC, PR-AUC, precision, recall, F1, and confusion matrices.
Notebook: notebooks/01_preprocessing.ipynb
Input:
data/fraud_oracle_with_telematics.csv
Output:
data/preprocessed.csv
Main steps:
- Standardizes column names and text values.
- Creates a stable
claim_idfromPolicyNumber. - Drops administrative identifiers such as
PolicyNumberandRepNumberfrom model features. - Fills numeric missing values with medians.
- Fills categorical missing values with modes.
- Converts range-like insurance fields into numeric values.
- Engineers telematics and claim-risk features:
speeding_riskspeed_volatilityharsh_braking_riskharsh_acceleration_riskharsh_cornering_riskharsh_driving_indexhigh_night_drivinghigh_urban_drivingfast_claimhigh_claim_historyclaim_driving_risk
- Encodes categorical columns with
LabelEncoder. - Scales numeric columns with
StandardScaler, while preservingclaim_id,fraudfound_p, andyear.
Checked output shape:
data/preprocessed.csv: 15,420 rows x 53 columns
Notebook: notebooks/02_tabular_model.ipynb
Input:
data/preprocessed.csv
Output:
data/tabular_embeddings.csv
Model:
- Primary:
CatBoostClassifier - Fallback:
HistGradientBoostingClassifierif CatBoost is unavailable
Main steps:
- Splits data with stratification using
test_size=0.2andrandom_state=42. - Trains a balanced classifier for the rare fraud target.
- Uses CatBoost leaf indexes as compact tabular embeddings.
- Normalizes embedding columns for fusion stability.
- Saves 16 embedding dimensions, raw score, tabular fraud probability, and target.
Checked output shape:
data/tabular_embeddings.csv: 15,420 rows x 20 columns
Representative checked metrics from the notebook:
ROC-AUC: 0.8330
PR-AUC: 0.2192
fraud recall at 0.5 threshold: 0.79
fraud precision at 0.5 threshold: 0.15
Notebook: notebooks/03_temporal_model.ipynb
Input:
data/preprocessed.csv
Output:
data/temporal_embeddings.csv
Model:
- PyTorch LSTM
Main steps:
- Selects temporal and telematics columns.
- Sorts claims by available time fields:
year,month,weekofmonth,dayofweek, andclaim_id. - Builds fixed-length rolling windows of length
5. - Trains an LSTM with weighted cross entropy.
- Saves 16 temporal embedding dimensions and temporal fraud probability.
Checked output shape:
data/temporal_embeddings.csv: 15,420 rows x 19 columns
The current temporal model is useful as a branch signal, but its checked standalone metrics are weak compared with the tabular model. This is expected because the source data has one row per claim, so the notebook constructs local sequence context from neighboring claims rather than true per-policy time series.
Notebook: notebooks/04_graph_model.ipynb
Input:
data/preprocessed.csv
Output:
data/graph_embeddings.csv
Model:
- PyTorch Geometric GCN
Main steps:
- Treats each claim as a graph node.
- Builds edges between claims that share important relationship attributes.
- Uses a bounded number of edges per group to avoid giant fully connected components.
- Trains a graph convolutional network with class-weighted loss.
- Saves 32 graph embedding dimensions and graph fraud probability.
Relationship columns used when available:
makeaccidentareafaultpolicytypevehiclecategorybasepolicyageofvehiclepastnumberofclaimshigh_claim_history
Checked graph size:
nodes: 15,420
edges: 906,604
Checked output shape:
data/graph_embeddings.csv: 15,420 rows x 35 columns
Notebook: notebooks/05_fusion.ipynb
Inputs:
data/tabular_embeddings.csvdata/temporal_embeddings.csvdata/graph_embeddings.csv
Output:
- The notebook trains and evaluates the final model in memory. It does not currently save a model artifact.
Model:
- PyTorch feed-forward DNN
Main steps:
- Merges all branch outputs on
claim_idandfraudfound_p. - Drops IDs, target columns, and raw-score leakage columns.
- Creates additional fusion features:
avg_probmax_probmin_probweighted_scoretab_x_graphtab_x_temptab_graph_difftab_temp_difffinal_hint
- Scales fusion features.
- Trains a weighted binary classifier with early stopping on ROC-AUC.
- Evaluates default threshold behavior.
- Runs threshold tuning for:
- precision-first selection under a minimum recall target
- minimum business cost selection
- optional focal-loss comparisons
Checked merged shape:
15,420 rows x 70 columns before dropping/feature expansion
Representative checked final metrics:
Best ROC-AUC: 0.8291
Test Accuracy @0.5: 0.6848
Precision @0.5: 0.1406
Recall @0.5: 0.8324
F1 @0.5: 0.2406
PR-AUC: 0.2259
The threshold analysis shows why recall-focused tuning matters for fraud detection. For example, the checked run selected a precision-first threshold of 0.54 while maintaining recall at 0.80.
Install the following before running the notebooks:
- Python 3.10 or newer
pip- Jupyter Notebook or JupyterLab
- A C/C++ build toolchain only if your platform cannot install prebuilt wheels for PyTorch, CatBoost, or PyTorch Geometric
The repository was inspected in an environment with Python 3.13.5, but some ML packages may have better wheel availability on Python 3.10, 3.11, or 3.12. If installation fails on Python 3.13, create a Python 3.11 environment and retry.
git clone git@github.com:hineni26/vehicle_fraud_detection.git
cd vehicle_fraud_detectionUsing Python venv:
python3 -m venv .venv
source .venv/bin/activateOn Windows PowerShell:
py -m venv .venv
.\.venv\Scripts\Activate.ps1Using Conda:
conda create -n vehicle-fraud python=3.11 -y
conda activate vehicle-fraudpython -m pip install --upgrade pip setuptools wheelpip install -r requirements.txtThe dependency file includes:
pandasnumpyscikit-learnmatplotlibseaborncatboosttorchtorch-geometricimblearnipykernel
python -m ipykernel install --user --name vehicle-fraud --display-name "Vehicle Fraud Detection"Then select the Vehicle Fraud Detection kernel inside Jupyter.
jupyter labor:
jupyter notebookOpen the notebooks from the notebooks/ directory and run them in order.
torch-geometric depends on your Python version, PyTorch version, CPU/GPU setup, and operating system. If the normal requirements install fails, install PyTorch first from the official selector for your platform, then install PyTorch Geometric.
CPU-only example:
pip install torch
pip install torch-geometricIf you use CUDA, install the PyTorch build matching your CUDA version before installing torch-geometric.
After installation, verify the graph dependency:
python -c "import torch; import torch_geometric; print(torch.__version__, torch_geometric.__version__)"Run the notebooks in this exact order:
1. notebooks/01_preprocessing.ipynb
2. notebooks/02_tabular_model.ipynb
3. notebooks/03_temporal_model.ipynb
4. notebooks/04_graph_model.ipynb
5. notebooks/05_fusion.ipynb
The dependency chain is:
fraud_oracle_with_telematics.csv
|
v
01_preprocessing.ipynb
|
v
preprocessed.csv
|
+--> 02_tabular_model.ipynb --> tabular_embeddings.csv
+--> 03_temporal_model.ipynb --> temporal_embeddings.csv
+--> 04_graph_model.ipynb --> graph_embeddings.csv
|
v
05_fusion.ipynb
You can rerun an individual branch notebook after preprocessing, but the fusion notebook requires all three embedding files to exist.
The current project includes generated CSV files in data/ so the fusion workflow can be inspected immediately. If you rerun notebooks, these files may be overwritten:
data/preprocessed.csvdata/tabular_embeddings.csvdata/temporal_embeddings.csvdata/graph_embeddings.csv
The .gitignore also excludes common training outputs such as:
notebooks/catboost_info/models/*.pkl*.joblib*.pt*.pth- notebook checkpoints
The notebooks use fixed random seeds where practical:
random_state=42for train/test splitsrandom_seed=42for CatBoost
PyTorch models can still vary slightly across runs because of initialization, backend behavior, hardware, and package versions. For stricter reproducibility, add explicit numpy and torch seeds at the top of the PyTorch notebooks and configure deterministic PyTorch behavior.
Because the fraud class is rare, use these metrics instead of relying on accuracy:
- ROC-AUC: broad ranking quality across thresholds.
- PR-AUC: more informative for rare fraud labels.
- Recall: percentage of actual frauds caught.
- Precision: percentage of flagged claims that are actually fraud.
- F1 score: balance between precision and recall.
- Confusion matrix: operational view of false alarms and missed frauds.
- Business cost: custom threshold selector in the fusion notebook using false-positive and false-negative costs.
In fraud screening, a higher recall threshold may be preferred even if precision is low, because missing an actual fraud can be more expensive than reviewing a legitimate claim.
Use python3 instead:
python3 -m venv .venv
python3 -m pip install -r requirements.txtTry upgrading packaging tools first:
python -m pip install --upgrade pip setuptools wheel
pip install catboostThe tabular notebook has a sklearn fallback, but CatBoost is recommended for the intended workflow.
Install PyTorch and PyTorch Geometric in separate steps and verify compatible versions:
pip install torch
pip install torch-geometric
python -c "import torch_geometric; print('ok')"Register the environment as a kernel:
python -m ipykernel install --user --name vehicle-fraud --display-name "Vehicle Fraud Detection"Restart Jupyter after registering the kernel.
Run these notebooks first:
01_preprocessing.ipynb
02_tabular_model.ipynb
03_temporal_model.ipynb
04_graph_model.ipynb
Then rerun 05_fusion.ipynb.
Good next engineering improvements include:
- Save trained models and scalers to a
models/directory. - Convert notebook logic into reusable Python modules.
- Add a command-line training pipeline.
- Use a validation split for threshold selection instead of selecting thresholds on the test split.
- Add model cards or experiment logs for each branch.
- Add SHAP or CatBoost feature importance reporting for the tabular model.
- Replace synthetic neighboring-claim sequences with true policyholder or vehicle-level time series if available.
- Add unit tests for preprocessing and data validation.
This project is licensed under the MIT License. See LICENSE for details.