Replication Instructions

Xueli Zhang, Wei Miao, Junhong Chu, Ivan Png

This repository contains the replication code for the paper “The Design of Centralized Matching Systems on Two-Sided Platforms: Evidence from the Ride-Hailing Market” by Xueli Zhang, Wei Miao, Junhong Chu, and Ivan Png, forthcoming in Marketing Science.

Data Availability

The raw data used in this study were provided by a taxi company under a non-disclosure agreement (NDA) and cannot be shared publicly. However, we document the data structure below to facilitate understanding of the replication code.

The project structure is as follows. All raw data is stored in data/raw data/.

CDGDispatch/
├── code/                          # R scripts
│   ├── 00-setup.R                 # Environment setup
│   ├── 01-clean_data.R            # Data cleaning
│   ├── 02-estimate_matching_function.R
│   ├── 03-demand_estimation.R     # BLP demand estimation
│   ├── 04-supply_estimation.R     # Prepare data for Julia
│   ├── 05-counterfactual.R        # Counterfactual data prep
│   ├── 06-figures_tables.R        # Tables and figures
│   ├── main.R                     # Main entry point
│   └── utils/                     # Helper functions
├── julia/                         # Julia scripts
│   ├── main.jl                    # Main entry point
│   ├── data.jl                    # Data loading
│   ├── demand_update.jl           # Demand side updates
│   ├── taxi_equilibrium.jl        # Equilibrium solver
│   ├── estimation.jl              # Supply estimation
│   ├── simulation.jl              # Shift simulation
│   ├── counterfactual.jl          # Counterfactual analysis
│   ├── figures.jl                 # Figure generation
│   ├── table.jl                   # Table generation
│   ├── utils.jl                   # Utilities
│   ├── shell/                     # HPC job scripts
│   └── data_from_R/               # Data from R pipeline
├── data/
│   ├── raw data/                  # Input data (not shared)
│   │   ├── street hails/          # Street-hail trip CSVs
│   │   ├── booking jobs/          # E-hail trip CSVs
│   │   ├── vehicle location fst/  # GPS data
│   │   ├── datamall_download/     # Public transport data
│   │   └── map_of_singapore/      # Shapefiles
│   ├── cleaned data/              # Processed data (generated)
│   └── interim data/              # Intermediate files
├── Project.toml                   # Julia dependencies
├── Manifest.toml                  # Julia dependency versions
└── readme.qmd                     # This file

Trip data (proprietary)

Street-hail trips (data/raw data/street hails/*.csv):

Column	Description
`job_no`	Unique trip identifier
`vehicle_id`	Vehicle identifier
`driver_id`	Driver identifier
`pickup_postcode`	Pickup location postal code
`dest_postcode`	Destination postal code
`total_trip_fare`	Trip fare in SGD
`distance`	Trip distance
`trip_start_dt`	Trip start datetime (DD/MM/YYYY HH:MM:SS)
`trip_end_dt`	Trip end datetime

Booking (e-hail) trips (data/raw data/booking jobs/*.csv):

Column	Description
`job_no`	Unique trip identifier
`rider_id`	Rider identifier (some files)
`booking_dt`	Booking request datetime
`booking_channel`	Booking channel (app, phone, etc.)
`product`	Product type
`req_pickup_dt`	Requested pickup datetime
`pickup_postcode`	Pickup postal code
`dest_postcode`	Destination postal code
`vehicle_id`	Vehicle identifier
`driver_id`	Driver identifier
`job_status`	Job status (completed, failed, etc.)
`total_trip_fare`	Trip fare in SGD
`distance`	Trip distance
`trip_start_dt`	Trip start datetime
`trip_end_dt`	Trip end datetime

Vehicle GPS data (proprietary)

Vehicle location logs (data/raw data/vehicle location fst/*.fst):

Column	Description
`vehicle_id`	Vehicle identifier
`log_dt`	Log datetime
`veh_long`	Vehicle longitude
`veh_lat`	Vehicle latitude
`veh_status`	Vehicle status (FREE, POB, ONCALL, ARRIVED, NOSHOW, BUSY, BREAK, OFFLINE, POWEROFF, STC, PAYMENT)

Driver demographics (proprietary)

Driver master files (data/raw data/driver_master.csv, driver_master_august.csv):

Column	Description
`driver_id`	Driver identifier
`work_since_dt`	Date driver started working
`driver_birth_dt`	Driver birth date
`driver_gender`	Driver gender
`driver_race`	Driver race
`driver_type`	Driver type (full-time, part-time, etc.)

Geographic data

Files in data/raw data/:

Postcode geocoordinates (Singapore_postcode_geocoordinates.csv):

Column	Description
`postal`	Singapore postal code
`longitude`	Longitude coordinate
`latitude`	Latitude coordinate

Singapore planning areas shapefile (map_of_singapore/MasterPlan/):

File: MP14_PLNG_AREA_WEB_PL.shp. Singapore Urban Redevelopment Authority (URA) Master Plan 2014 planning area boundaries. Available from data.gov.sg.

Public transport data (from LTA DataMall)

Files in data/raw data/datamall_download/. Data from Singapore Land Transport Authority (LTA) DataMall, available at datamall.lta.gov.sg.

Bus stop information (bus_stops_info.csv):

Column	Description
`BusStopCode`	Bus stop code
`Longitude`	Longitude coordinate
`Latitude`	Latitude coordinate

Train station information (mrtsg.csv):

Column	Description
`STN_NO`	Station code
`Longitude`	Longitude coordinate
`Latitude`	Latitude coordinate

Bus OD data (origin_destination_bus_2021{08,09,10}.csv):

Column	Description
`ORIGIN_PT_CODE`	Origin bus stop code
`DESTINATION_PT_CODE`	Destination bus stop code
`TIME_PER_HOUR`	Hour of day
`DAY_TYPE`	Day type (WEEKDAY/WEEKEND)
`TOTAL_TRIPS`	Total trips

Train OD data (origin_destination_train_2021{08,09,10}.csv):

Column	Description
`ORIGIN_PT_CODE`	Origin station code
`DESTINATION_PT_CODE`	Destination station code
`TIME_PER_HOUR`	Hour of day
`DAY_TYPE`	Day type (WEEKDAY/WEEKEND)
`TOTAL_TRIPS`	Total trips

Setup

This repository uses R for data cleaning / matching / demand estimation, and Julia for supply estimation and counterfactual analysis. The Julia dependencies are pinned via Project.toml and Manifest.toml. The R scripts install required CRAN packages on demand (see 00-setup.R).

Supply estimation (Julia) is computationally intensive and was run on the UCL Myriad HPC cluster using SGE job scheduling.

All codes are tested on the following environment:

Local machine (R data preparation):

OS: macOS 26.2 (Build 25C56)
CPU: Intel(R) Xeon(R) W-3265M CPU @ 2.70GHz
RAM: 824633720832 bytes (~768 GiB)
R: 4.5.2 (2025-10-31)
Julia: 1.12.3

UCL Myriad HPC cluster (Julia supply estimation):

R: 4.4.2 (OpenBLAS build, module r/4.4.2-openblas/gnu-10.2.0)
Julia: 1.12.3

R setup

Install R (tested on R 4.5.2 locally; R 4.4.2-OpenBLAS on UCL cluster).
Install system dependencies needed by geospatial packages (notably sf, lwgeom).

On macOS with Homebrew:

brew install gdal geos proj pkg-config

On UCL Myriad, load the R module:

module load r/4.4.2-openblas/gnu-10.2.0

Julia setup

Install Julia (tested on Julia 1.12.3).
Instantiate the Julia project environment (uses the pinned Manifest.toml):

cd /path/to/CDGDispatch
julia --project -e 'using Pkg; Pkg.instantiate()'

How to run the project

The R replication pipeline is orchestrated by code/main.R (entry point). It sets up the R environment (00-setup.R), cleans the data (01-clean_data.R), estimates the matching function (02-estimate_matching_function.R), estimates the demand (03-demand_estimation.R), prepares data for supply estimation (04-supply_estimation.R), prepares data for counterfactual analysis (05-counterfactual.R), and generates figures and tables (06-figures_tables.R).

The Julia replication pipeline is orchestrated by julia/main.jl (entry point).

To replicate the results in the paper, run in the following order:

Setup R environment

# load libraries and global variables
source(file.path("code", "00-setup.R"))
force_cache <- FALSE

Installs and loads required R packages (if a package is missing, it calls install.packages(..., dependencies = TRUE)).
Sets global options (e.g., sf_use_s2(FALSE)).
Defines global constants used across scripts, including
- data_end_date: the last date of data to use. We choose 2017-04-09 as the last date of data becuase a new pricing structure, flat fare, was implemented on 2017-04-10.
- publicholiday: public holidays in Singapore.
- period_interval: the length of each period in minutes (5 minutes in this study).
- locations_excluded: locations to exclude from the analysis due to near-zero demand.
- airport_location_id: the location IDs of the airports.
- work_period: the work period in terms of 5 minutes.
- work_period_h: the work period in terms of hours.

Data preparation

Data preparation is implemented in code/01-clean_data.R and produces the cleaned trip-level dataset used in later steps. All output files are saved to data/cleaned data/ unless otherwise noted.

Run each step in order as follows.

source("code/01-clean_data.R")
save_data_full()

Load and combine raw street-hail + booking files, apply date filters (data_end_date), and save data_full.fst.

save_data_raw_final()

Clean data_full.fst (imputations/filters) and save data_raw_final.fst plus cleaning diagnostics (data_cns.fst, data_cleaning_process.fst).

save_data_raw_final_with_location_id()

Assign pickup/destination location_id (postcode geocoding + GPS + masterplan) and save data_raw_final_with_location_id.fst.

save_data_raw_final_with_location_id_with_shift_id()

Construct driver shifts (shift_id, 6-hour gap rule) and save the result as .fst file with shift IDs.

save_data_trip_final()

Create the analysis trip dataset (timing variables, work-period restriction, weekday/holiday filters) and save data_trip_final.fst.

Matching function estimation

Matching function estimation is implemented in 02-estimate_matching_function.R. It combines (i) GPS-based taxi availability and (ii) trip counts from data_trip_final.fst to recover matching parameters by location/time.

source("code/02-estimate_matching_function.R")
gen_veh_status_time_longlat()

Process GPS logs into time-location taxi availability and save veh_status_time_longlat.fst.

prepare_data_for_matching()

Construct the estimation panel and save data_for_matching.fst.

estimate_matching_function()

Estimate matching parameters (Section 5.1, Table 5 in paper) and save recovered_lambdas_time_location.fst, alphas_region.csv, and recovered_alpha.rds.

Demand estimation

Demand estimation is implemented in code/03-demand_estimation.R (BLP estimation via BLPestimatoR package).

source("code/03-demand_estimation.R")
get_street_hail_avg_waiting()

Compute empirical average hail waiting time and save street_trips_average_waiting.t.o.fst.

get_public_transports_info()

Build public-transport OD proportions and save train_proportions_od.fst and bus_proportions_od.fst.

get_taxi_trips_info()

Build taxi OD matrices and save:

grid_data_mean.pickup_t.o.fst
db_mean.fare_t.o.d.fst
db_trip_count_t.o.d.fst
db_trip_dropoff_t.d.fst
db_mean.distance.duration_t.o.d.fst

get_failed_bookings_by_pickup_dest()

Build failed-booking counts and save failed_booking.t.o.d.fst.

save_neighbor_matrix()

Build adjacency + ID mapping and save neighbor_matrix.rds, location_id_mapping.rds, and consolidated_masterplan.rds.

prepare_data_for_demand()

Assemble the demand-estimation dataset (uses recovered_lambdas_time_location.fst) and save data_for_demand.fst.

demand_estimation_BLP()

Run BLP estimation (Table 6 in paper) and save to BLP results/:

data_for_BLP.rds
trips_data.rds
demand_est.rds

Supply estimation (R data preparation)

source("code/04-supply_estimation.R")
prepare_data_for_supply()

Prepare Julia inputs for supply estimation and save to julia/data_from_R/.

Counterfactual analysis

source("code/05-counterfactual.R")
get_booking_coordinates()

Attach driver/rider coordinates at booking time and save booking_trips_coordinates.fst.

get_around_drivers()

For each booking, collect nearby available drivers and save per-date .fst files to around drivers/.

counterfactual_nearest_driver()

Compute nearest-driver line distance by location-period and save nearest_driver_distance_date.fst and trips_pickup_distance.fst.

counterfactual_estimate_pickupdistance()

Calibrate line distance to driving distance/time and save speed_pickupdistance_allocation_ot.fst.

prepare_data_for_counterfactual()

Build Julia pickup-time/cost arrays for counterfactuals and save supply_estimation_data_counterfactual.rds.

Figures and tables (R)

source("code/06-figures_tables.R")
table_summary_statistics()

Generate summary statistics (Table 1 in paper).

table_matching_inefficiency()

Display recovered matching inefficiency parameters by region (Table 5 in paper).

Supply estimation and counterfactual (Julia)

After the R pipeline completes, run the Julia supply estimation on an HPC cluster. The estimation uses a grid search with 64 different starting values (array job).

HPC cluster submission

Submit the array job using the shell script:

qsub julia/shell/main_julia.sh

The shell script (julia/shell/main_julia.sh) is configured for SGE clusters with:

72-hour runtime limit, 40GB memory per task
Array job with 64 tasks (-t 1-64)
Each task runs julia --project julia/main.jl "outer-iter-150" 1 false

Post-estimation analysis

After all 64 estimation tasks complete, run the post-estimation analysis to:

Select the best parameter estimates across starting values
Compute standard errors (Table 8, Panel A)
Generate model fit comparison (Table 8, Panel B)
Plot convergence figures (Figure 3)
Run counterfactual analysis (Table 9)
Generate appendix figures

julia --project julia/main.jl "<JOB_ID>_outer-iter-150" 8 true

Replace <JOB_ID> with the job ID assigned by the HPC cluster (e.g., 138534).

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
code		code
julia		julia
Manifest.toml		Manifest.toml
Project.toml		Project.toml
readme.md		readme.md
readme.pdf		readme.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Replication Instructions

Data Availability

Trip data (proprietary)

Vehicle GPS data (proprietary)

Driver demographics (proprietary)

Geographic data

Public transport data (from LTA DataMall)

Setup

R setup

Julia setup

How to run the project

Setup R environment

Data preparation

Matching function estimation

Demand estimation

Supply estimation (R data preparation)

Counterfactual analysis

Figures and tables (R)

Supply estimation and counterfactual (Julia)

HPC cluster submission

Post-estimation analysis

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Replication Instructions

Data Availability

Trip data (proprietary)

Vehicle GPS data (proprietary)

Driver demographics (proprietary)

Geographic data

Public transport data (from LTA DataMall)

Setup

R setup

Julia setup

How to run the project

Setup R environment

Data preparation

Matching function estimation

Demand estimation

Supply estimation (R data preparation)

Counterfactual analysis

Figures and tables (R)

Supply estimation and counterfactual (Julia)

HPC cluster submission

Post-estimation analysis

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages