Xueli Zhang, Wei Miao, Junhong Chu, Ivan Png
This repository contains the replication code for the paper “The Design of Centralized Matching Systems on Two-Sided Platforms: Evidence from the Ride-Hailing Market” by Xueli Zhang, Wei Miao, Junhong Chu, and Ivan Png, forthcoming in Marketing Science.
The raw data used in this study were provided by a taxi company under a non-disclosure agreement (NDA) and cannot be shared publicly. However, we document the data structure below to facilitate understanding of the replication code.
The project structure is as follows. All raw data is stored in
data/raw data/.
CDGDispatch/
├── code/ # R scripts
│ ├── 00-setup.R # Environment setup
│ ├── 01-clean_data.R # Data cleaning
│ ├── 02-estimate_matching_function.R
│ ├── 03-demand_estimation.R # BLP demand estimation
│ ├── 04-supply_estimation.R # Prepare data for Julia
│ ├── 05-counterfactual.R # Counterfactual data prep
│ ├── 06-figures_tables.R # Tables and figures
│ ├── main.R # Main entry point
│ └── utils/ # Helper functions
├── julia/ # Julia scripts
│ ├── main.jl # Main entry point
│ ├── data.jl # Data loading
│ ├── demand_update.jl # Demand side updates
│ ├── taxi_equilibrium.jl # Equilibrium solver
│ ├── estimation.jl # Supply estimation
│ ├── simulation.jl # Shift simulation
│ ├── counterfactual.jl # Counterfactual analysis
│ ├── figures.jl # Figure generation
│ ├── table.jl # Table generation
│ ├── utils.jl # Utilities
│ ├── shell/ # HPC job scripts
│ └── data_from_R/ # Data from R pipeline
├── data/
│ ├── raw data/ # Input data (not shared)
│ │ ├── street hails/ # Street-hail trip CSVs
│ │ ├── booking jobs/ # E-hail trip CSVs
│ │ ├── vehicle location fst/ # GPS data
│ │ ├── datamall_download/ # Public transport data
│ │ └── map_of_singapore/ # Shapefiles
│ ├── cleaned data/ # Processed data (generated)
│ └── interim data/ # Intermediate files
├── Project.toml # Julia dependencies
├── Manifest.toml # Julia dependency versions
└── readme.qmd # This file
Street-hail trips (data/raw data/street hails/*.csv):
| Column | Description |
|---|---|
job_no |
Unique trip identifier |
vehicle_id |
Vehicle identifier |
driver_id |
Driver identifier |
pickup_postcode |
Pickup location postal code |
dest_postcode |
Destination postal code |
total_trip_fare |
Trip fare in SGD |
distance |
Trip distance |
trip_start_dt |
Trip start datetime (DD/MM/YYYY HH:MM:SS) |
trip_end_dt |
Trip end datetime |
Booking (e-hail) trips (data/raw data/booking jobs/*.csv):
| Column | Description |
|---|---|
job_no |
Unique trip identifier |
rider_id |
Rider identifier (some files) |
booking_dt |
Booking request datetime |
booking_channel |
Booking channel (app, phone, etc.) |
product |
Product type |
req_pickup_dt |
Requested pickup datetime |
pickup_postcode |
Pickup postal code |
dest_postcode |
Destination postal code |
vehicle_id |
Vehicle identifier |
driver_id |
Driver identifier |
job_status |
Job status (completed, failed, etc.) |
total_trip_fare |
Trip fare in SGD |
distance |
Trip distance |
trip_start_dt |
Trip start datetime |
trip_end_dt |
Trip end datetime |
Vehicle location logs (data/raw data/vehicle location fst/*.fst):
| Column | Description |
|---|---|
vehicle_id |
Vehicle identifier |
log_dt |
Log datetime |
veh_long |
Vehicle longitude |
veh_lat |
Vehicle latitude |
veh_status |
Vehicle status (FREE, POB, ONCALL, ARRIVED, NOSHOW, BUSY, BREAK, OFFLINE, POWEROFF, STC, PAYMENT) |
Driver master files (data/raw data/driver_master.csv,
driver_master_august.csv):
| Column | Description |
|---|---|
driver_id |
Driver identifier |
work_since_dt |
Date driver started working |
driver_birth_dt |
Driver birth date |
driver_gender |
Driver gender |
driver_race |
Driver race |
driver_type |
Driver type (full-time, part-time, etc.) |
Files in data/raw data/:
Postcode geocoordinates (Singapore_postcode_geocoordinates.csv):
| Column | Description |
|---|---|
postal |
Singapore postal code |
longitude |
Longitude coordinate |
latitude |
Latitude coordinate |
Singapore planning areas shapefile (map_of_singapore/MasterPlan/):
File: MP14_PLNG_AREA_WEB_PL.shp. Singapore Urban Redevelopment
Authority (URA) Master Plan 2014 planning area boundaries. Available
from data.gov.sg.
Files in data/raw data/datamall_download/. Data from Singapore Land
Transport Authority (LTA) DataMall, available at
datamall.lta.gov.sg.
Bus stop information (bus_stops_info.csv):
| Column | Description |
|---|---|
BusStopCode |
Bus stop code |
Longitude |
Longitude coordinate |
Latitude |
Latitude coordinate |
Train station information (mrtsg.csv):
| Column | Description |
|---|---|
STN_NO |
Station code |
Longitude |
Longitude coordinate |
Latitude |
Latitude coordinate |
Bus OD data (origin_destination_bus_2021{08,09,10}.csv):
| Column | Description |
|---|---|
ORIGIN_PT_CODE |
Origin bus stop code |
DESTINATION_PT_CODE |
Destination bus stop code |
TIME_PER_HOUR |
Hour of day |
DAY_TYPE |
Day type (WEEKDAY/WEEKEND) |
TOTAL_TRIPS |
Total trips |
Train OD data (origin_destination_train_2021{08,09,10}.csv):
| Column | Description |
|---|---|
ORIGIN_PT_CODE |
Origin station code |
DESTINATION_PT_CODE |
Destination station code |
TIME_PER_HOUR |
Hour of day |
DAY_TYPE |
Day type (WEEKDAY/WEEKEND) |
TOTAL_TRIPS |
Total trips |
This repository uses R for data cleaning / matching / demand
estimation, and Julia for supply estimation and counterfactual
analysis. The Julia dependencies are pinned via Project.toml and
Manifest.toml. The R scripts install required CRAN packages on demand
(see 00-setup.R).
Supply estimation (Julia) is computationally intensive and was run on the UCL Myriad HPC cluster using SGE job scheduling.
All codes are tested on the following environment:
Local machine (R data preparation):
OS: macOS 26.2 (Build 25C56)
CPU: Intel(R) Xeon(R) W-3265M CPU @ 2.70GHz
RAM: 824633720832 bytes (~768 GiB)
R: 4.5.2 (2025-10-31)
Julia: 1.12.3
UCL Myriad HPC cluster (Julia supply estimation):
R: 4.4.2 (OpenBLAS build, module r/4.4.2-openblas/gnu-10.2.0)
Julia: 1.12.3
- Install R (tested on R 4.5.2 locally; R 4.4.2-OpenBLAS on UCL cluster).
- Install system dependencies needed by geospatial packages (notably
sf,lwgeom).
On macOS with Homebrew:
brew install gdal geos proj pkg-configOn UCL Myriad, load the R module:
module load r/4.4.2-openblas/gnu-10.2.0- Install Julia (tested on Julia 1.12.3).
- Instantiate the Julia project environment (uses the pinned
Manifest.toml):
cd /path/to/CDGDispatch
julia --project -e 'using Pkg; Pkg.instantiate()'The R replication pipeline is orchestrated by code/main.R (entry
point). It sets up the R environment (00-setup.R), cleans the data
(01-clean_data.R), estimates the matching function
(02-estimate_matching_function.R), estimates the demand
(03-demand_estimation.R), prepares data for supply estimation
(04-supply_estimation.R), prepares data for counterfactual analysis
(05-counterfactual.R), and generates figures and tables
(06-figures_tables.R).
The Julia replication pipeline is orchestrated by julia/main.jl (entry
point).
To replicate the results in the paper, run in the following order:
# load libraries and global variables
source(file.path("code", "00-setup.R"))
force_cache <- FALSE- Installs and loads required R packages (if a package is missing,
it calls
install.packages(..., dependencies = TRUE)). - Sets global options (e.g.,
sf_use_s2(FALSE)). - Defines global constants used across scripts, including
data_end_date: the last date of data to use. We choose 2017-04-09 as the last date of data becuase a new pricing structure, flat fare, was implemented on 2017-04-10.publicholiday: public holidays in Singapore.period_interval: the length of each period in minutes (5 minutes in this study).locations_excluded: locations to exclude from the analysis due to near-zero demand.airport_location_id: the location IDs of the airports.work_period: the work period in terms of 5 minutes.work_period_h: the work period in terms of hours.
Data preparation is implemented in code/01-clean_data.R and produces
the cleaned trip-level dataset used in later steps. All output files are
saved to data/cleaned data/ unless otherwise noted.
Run each step in order as follows.
source("code/01-clean_data.R")
save_data_full()Load and combine raw street-hail + booking files, apply date filters
(data_end_date), and save data_full.fst.
save_data_raw_final()Clean data_full.fst (imputations/filters) and save
data_raw_final.fst plus cleaning diagnostics (data_cns.fst,
data_cleaning_process.fst).
save_data_raw_final_with_location_id()Assign pickup/destination location_id (postcode geocoding + GPS +
masterplan) and save data_raw_final_with_location_id.fst.
save_data_raw_final_with_location_id_with_shift_id()Construct driver shifts (shift_id, 6-hour gap rule) and save the
result as .fst file with shift IDs.
save_data_trip_final()Create the analysis trip dataset (timing variables, work-period
restriction, weekday/holiday filters) and save data_trip_final.fst.
Matching function estimation is implemented in
02-estimate_matching_function.R. It combines (i) GPS-based taxi
availability and (ii) trip counts from data_trip_final.fst to recover
matching parameters by location/time.
source("code/02-estimate_matching_function.R")
gen_veh_status_time_longlat()Process GPS logs into time-location taxi availability and save
veh_status_time_longlat.fst.
prepare_data_for_matching()Construct the estimation panel and save data_for_matching.fst.
estimate_matching_function()Estimate matching parameters (Section 5.1, Table 5 in paper) and save
recovered_lambdas_time_location.fst, alphas_region.csv, and
recovered_alpha.rds.
Demand estimation is implemented in code/03-demand_estimation.R (BLP
estimation via BLPestimatoR package).
source("code/03-demand_estimation.R")
get_street_hail_avg_waiting()Compute empirical average hail waiting time and save
street_trips_average_waiting.t.o.fst.
get_public_transports_info()Build public-transport OD proportions and save
train_proportions_od.fst and bus_proportions_od.fst.
get_taxi_trips_info()Build taxi OD matrices and save:
grid_data_mean.pickup_t.o.fstdb_mean.fare_t.o.d.fstdb_trip_count_t.o.d.fstdb_trip_dropoff_t.d.fstdb_mean.distance.duration_t.o.d.fst
get_failed_bookings_by_pickup_dest()Build failed-booking counts and save failed_booking.t.o.d.fst.
save_neighbor_matrix()Build adjacency + ID mapping and save neighbor_matrix.rds,
location_id_mapping.rds, and consolidated_masterplan.rds.
prepare_data_for_demand()Assemble the demand-estimation dataset (uses
recovered_lambdas_time_location.fst) and save data_for_demand.fst.
demand_estimation_BLP()Run BLP estimation (Table 6 in paper) and save to BLP results/:
data_for_BLP.rdstrips_data.rdsdemand_est.rds
source("code/04-supply_estimation.R")
prepare_data_for_supply()Prepare Julia inputs for supply estimation and save to
julia/data_from_R/.
source("code/05-counterfactual.R")
get_booking_coordinates()Attach driver/rider coordinates at booking time and save
booking_trips_coordinates.fst.
get_around_drivers()For each booking, collect nearby available drivers and save per-date
.fst files to around drivers/.
counterfactual_nearest_driver()Compute nearest-driver line distance by location-period and save
nearest_driver_distance_date.fst and trips_pickup_distance.fst.
counterfactual_estimate_pickupdistance()Calibrate line distance to driving distance/time and save
speed_pickupdistance_allocation_ot.fst.
prepare_data_for_counterfactual()Build Julia pickup-time/cost arrays for counterfactuals and save
supply_estimation_data_counterfactual.rds.
source("code/06-figures_tables.R")
table_summary_statistics()Generate summary statistics (Table 1 in paper).
table_matching_inefficiency()Display recovered matching inefficiency parameters by region (Table 5 in paper).
After the R pipeline completes, run the Julia supply estimation on an HPC cluster. The estimation uses a grid search with 64 different starting values (array job).
Submit the array job using the shell script:
qsub julia/shell/main_julia.shThe shell script (julia/shell/main_julia.sh) is configured for SGE
clusters with:
- 72-hour runtime limit, 40GB memory per task
- Array job with 64 tasks (
-t 1-64) - Each task runs
julia --project julia/main.jl "outer-iter-150" 1 false
After all 64 estimation tasks complete, run the post-estimation analysis to:
- Select the best parameter estimates across starting values
- Compute standard errors (Table 8, Panel A)
- Generate model fit comparison (Table 8, Panel B)
- Plot convergence figures (Figure 3)
- Run counterfactual analysis (Table 9)
- Generate appendix figures
julia --project julia/main.jl "<JOB_ID>_outer-iter-150" 8 trueReplace <JOB_ID> with the job ID assigned by the HPC cluster (e.g.,
138534).