Skip to content

weim-mkt/mksc-replication-optimal-design-dispatch-system

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Replication Instructions

Xueli Zhang, Wei Miao, Junhong Chu, Ivan Png

This repository contains the replication code for the paper “The Design of Centralized Matching Systems on Two-Sided Platforms: Evidence from the Ride-Hailing Market” by Xueli Zhang, Wei Miao, Junhong Chu, and Ivan Png, forthcoming in Marketing Science.

Data Availability

The raw data used in this study were provided by a taxi company under a non-disclosure agreement (NDA) and cannot be shared publicly. However, we document the data structure below to facilitate understanding of the replication code.

The project structure is as follows. All raw data is stored in data/raw data/.

CDGDispatch/
├── code/                          # R scripts
│   ├── 00-setup.R                 # Environment setup
│   ├── 01-clean_data.R            # Data cleaning
│   ├── 02-estimate_matching_function.R
│   ├── 03-demand_estimation.R     # BLP demand estimation
│   ├── 04-supply_estimation.R     # Prepare data for Julia
│   ├── 05-counterfactual.R        # Counterfactual data prep
│   ├── 06-figures_tables.R        # Tables and figures
│   ├── main.R                     # Main entry point
│   └── utils/                     # Helper functions
├── julia/                         # Julia scripts
│   ├── main.jl                    # Main entry point
│   ├── data.jl                    # Data loading
│   ├── demand_update.jl           # Demand side updates
│   ├── taxi_equilibrium.jl        # Equilibrium solver
│   ├── estimation.jl              # Supply estimation
│   ├── simulation.jl              # Shift simulation
│   ├── counterfactual.jl          # Counterfactual analysis
│   ├── figures.jl                 # Figure generation
│   ├── table.jl                   # Table generation
│   ├── utils.jl                   # Utilities
│   ├── shell/                     # HPC job scripts
│   └── data_from_R/               # Data from R pipeline
├── data/
│   ├── raw data/                  # Input data (not shared)
│   │   ├── street hails/          # Street-hail trip CSVs
│   │   ├── booking jobs/          # E-hail trip CSVs
│   │   ├── vehicle location fst/  # GPS data
│   │   ├── datamall_download/     # Public transport data
│   │   └── map_of_singapore/      # Shapefiles
│   ├── cleaned data/              # Processed data (generated)
│   └── interim data/              # Intermediate files
├── Project.toml                   # Julia dependencies
├── Manifest.toml                  # Julia dependency versions
└── readme.qmd                     # This file

Trip data (proprietary)

Street-hail trips (data/raw data/street hails/*.csv):

Column Description
job_no Unique trip identifier
vehicle_id Vehicle identifier
driver_id Driver identifier
pickup_postcode Pickup location postal code
dest_postcode Destination postal code
total_trip_fare Trip fare in SGD
distance Trip distance
trip_start_dt Trip start datetime (DD/MM/YYYY HH:MM:SS)
trip_end_dt Trip end datetime

Booking (e-hail) trips (data/raw data/booking jobs/*.csv):

Column Description
job_no Unique trip identifier
rider_id Rider identifier (some files)
booking_dt Booking request datetime
booking_channel Booking channel (app, phone, etc.)
product Product type
req_pickup_dt Requested pickup datetime
pickup_postcode Pickup postal code
dest_postcode Destination postal code
vehicle_id Vehicle identifier
driver_id Driver identifier
job_status Job status (completed, failed, etc.)
total_trip_fare Trip fare in SGD
distance Trip distance
trip_start_dt Trip start datetime
trip_end_dt Trip end datetime

Vehicle GPS data (proprietary)

Vehicle location logs (data/raw data/vehicle location fst/*.fst):

Column Description
vehicle_id Vehicle identifier
log_dt Log datetime
veh_long Vehicle longitude
veh_lat Vehicle latitude
veh_status Vehicle status (FREE, POB, ONCALL, ARRIVED, NOSHOW, BUSY, BREAK, OFFLINE, POWEROFF, STC, PAYMENT)

Driver demographics (proprietary)

Driver master files (data/raw data/driver_master.csv, driver_master_august.csv):

Column Description
driver_id Driver identifier
work_since_dt Date driver started working
driver_birth_dt Driver birth date
driver_gender Driver gender
driver_race Driver race
driver_type Driver type (full-time, part-time, etc.)

Geographic data

Files in data/raw data/:

Postcode geocoordinates (Singapore_postcode_geocoordinates.csv):

Column Description
postal Singapore postal code
longitude Longitude coordinate
latitude Latitude coordinate

Singapore planning areas shapefile (map_of_singapore/MasterPlan/):

File: MP14_PLNG_AREA_WEB_PL.shp. Singapore Urban Redevelopment Authority (URA) Master Plan 2014 planning area boundaries. Available from data.gov.sg.

Public transport data (from LTA DataMall)

Files in data/raw data/datamall_download/. Data from Singapore Land Transport Authority (LTA) DataMall, available at datamall.lta.gov.sg.

Bus stop information (bus_stops_info.csv):

Column Description
BusStopCode Bus stop code
Longitude Longitude coordinate
Latitude Latitude coordinate

Train station information (mrtsg.csv):

Column Description
STN_NO Station code
Longitude Longitude coordinate
Latitude Latitude coordinate

Bus OD data (origin_destination_bus_2021{08,09,10}.csv):

Column Description
ORIGIN_PT_CODE Origin bus stop code
DESTINATION_PT_CODE Destination bus stop code
TIME_PER_HOUR Hour of day
DAY_TYPE Day type (WEEKDAY/WEEKEND)
TOTAL_TRIPS Total trips

Train OD data (origin_destination_train_2021{08,09,10}.csv):

Column Description
ORIGIN_PT_CODE Origin station code
DESTINATION_PT_CODE Destination station code
TIME_PER_HOUR Hour of day
DAY_TYPE Day type (WEEKDAY/WEEKEND)
TOTAL_TRIPS Total trips

Setup

This repository uses R for data cleaning / matching / demand estimation, and Julia for supply estimation and counterfactual analysis. The Julia dependencies are pinned via Project.toml and Manifest.toml. The R scripts install required CRAN packages on demand (see 00-setup.R).

Supply estimation (Julia) is computationally intensive and was run on the UCL Myriad HPC cluster using SGE job scheduling.

All codes are tested on the following environment:

Local machine (R data preparation):

OS: macOS 26.2 (Build 25C56)
CPU: Intel(R) Xeon(R) W-3265M CPU @ 2.70GHz
RAM: 824633720832 bytes (~768 GiB)
R: 4.5.2 (2025-10-31)
Julia: 1.12.3

UCL Myriad HPC cluster (Julia supply estimation):

R: 4.4.2 (OpenBLAS build, module r/4.4.2-openblas/gnu-10.2.0)
Julia: 1.12.3

R setup

  1. Install R (tested on R 4.5.2 locally; R 4.4.2-OpenBLAS on UCL cluster).
  2. Install system dependencies needed by geospatial packages (notably sf, lwgeom).

On macOS with Homebrew:

brew install gdal geos proj pkg-config

On UCL Myriad, load the R module:

module load r/4.4.2-openblas/gnu-10.2.0

Julia setup

  1. Install Julia (tested on Julia 1.12.3).
  2. Instantiate the Julia project environment (uses the pinned Manifest.toml):
cd /path/to/CDGDispatch
julia --project -e 'using Pkg; Pkg.instantiate()'

How to run the project

The R replication pipeline is orchestrated by code/main.R (entry point). It sets up the R environment (00-setup.R), cleans the data (01-clean_data.R), estimates the matching function (02-estimate_matching_function.R), estimates the demand (03-demand_estimation.R), prepares data for supply estimation (04-supply_estimation.R), prepares data for counterfactual analysis (05-counterfactual.R), and generates figures and tables (06-figures_tables.R).

The Julia replication pipeline is orchestrated by julia/main.jl (entry point).

To replicate the results in the paper, run in the following order:

Setup R environment

# load libraries and global variables
source(file.path("code", "00-setup.R"))
force_cache <- FALSE
  • Installs and loads required R packages (if a package is missing, it calls install.packages(..., dependencies = TRUE)).
  • Sets global options (e.g., sf_use_s2(FALSE)).
  • Defines global constants used across scripts, including
    • data_end_date: the last date of data to use. We choose 2017-04-09 as the last date of data becuase a new pricing structure, flat fare, was implemented on 2017-04-10.
    • publicholiday: public holidays in Singapore.
    • period_interval: the length of each period in minutes (5 minutes in this study).
    • locations_excluded: locations to exclude from the analysis due to near-zero demand.
    • airport_location_id: the location IDs of the airports.
    • work_period: the work period in terms of 5 minutes.
    • work_period_h: the work period in terms of hours.

Data preparation

Data preparation is implemented in code/01-clean_data.R and produces the cleaned trip-level dataset used in later steps. All output files are saved to data/cleaned data/ unless otherwise noted.

Run each step in order as follows.

source("code/01-clean_data.R")
save_data_full()

Load and combine raw street-hail + booking files, apply date filters (data_end_date), and save data_full.fst.

save_data_raw_final()

Clean data_full.fst (imputations/filters) and save data_raw_final.fst plus cleaning diagnostics (data_cns.fst, data_cleaning_process.fst).

save_data_raw_final_with_location_id()

Assign pickup/destination location_id (postcode geocoding + GPS + masterplan) and save data_raw_final_with_location_id.fst.

save_data_raw_final_with_location_id_with_shift_id()

Construct driver shifts (shift_id, 6-hour gap rule) and save the result as .fst file with shift IDs.

save_data_trip_final()

Create the analysis trip dataset (timing variables, work-period restriction, weekday/holiday filters) and save data_trip_final.fst.

Matching function estimation

Matching function estimation is implemented in 02-estimate_matching_function.R. It combines (i) GPS-based taxi availability and (ii) trip counts from data_trip_final.fst to recover matching parameters by location/time.

source("code/02-estimate_matching_function.R")
gen_veh_status_time_longlat()

Process GPS logs into time-location taxi availability and save veh_status_time_longlat.fst.

prepare_data_for_matching()

Construct the estimation panel and save data_for_matching.fst.

estimate_matching_function()

Estimate matching parameters (Section 5.1, Table 5 in paper) and save recovered_lambdas_time_location.fst, alphas_region.csv, and recovered_alpha.rds.

Demand estimation

Demand estimation is implemented in code/03-demand_estimation.R (BLP estimation via BLPestimatoR package).

source("code/03-demand_estimation.R")
get_street_hail_avg_waiting()

Compute empirical average hail waiting time and save street_trips_average_waiting.t.o.fst.

get_public_transports_info()

Build public-transport OD proportions and save train_proportions_od.fst and bus_proportions_od.fst.

get_taxi_trips_info()

Build taxi OD matrices and save:

  • grid_data_mean.pickup_t.o.fst
  • db_mean.fare_t.o.d.fst
  • db_trip_count_t.o.d.fst
  • db_trip_dropoff_t.d.fst
  • db_mean.distance.duration_t.o.d.fst
get_failed_bookings_by_pickup_dest()

Build failed-booking counts and save failed_booking.t.o.d.fst.

save_neighbor_matrix()

Build adjacency + ID mapping and save neighbor_matrix.rds, location_id_mapping.rds, and consolidated_masterplan.rds.

prepare_data_for_demand()

Assemble the demand-estimation dataset (uses recovered_lambdas_time_location.fst) and save data_for_demand.fst.

demand_estimation_BLP()

Run BLP estimation (Table 6 in paper) and save to BLP results/:

  • data_for_BLP.rds
  • trips_data.rds
  • demand_est.rds

Supply estimation (R data preparation)

source("code/04-supply_estimation.R")
prepare_data_for_supply()

Prepare Julia inputs for supply estimation and save to julia/data_from_R/.

Counterfactual analysis

source("code/05-counterfactual.R")
get_booking_coordinates()

Attach driver/rider coordinates at booking time and save booking_trips_coordinates.fst.

get_around_drivers()

For each booking, collect nearby available drivers and save per-date .fst files to around drivers/.

counterfactual_nearest_driver()

Compute nearest-driver line distance by location-period and save nearest_driver_distance_date.fst and trips_pickup_distance.fst.

counterfactual_estimate_pickupdistance()

Calibrate line distance to driving distance/time and save speed_pickupdistance_allocation_ot.fst.

prepare_data_for_counterfactual()

Build Julia pickup-time/cost arrays for counterfactuals and save supply_estimation_data_counterfactual.rds.

Figures and tables (R)

source("code/06-figures_tables.R")
table_summary_statistics()

Generate summary statistics (Table 1 in paper).

table_matching_inefficiency()

Display recovered matching inefficiency parameters by region (Table 5 in paper).

Supply estimation and counterfactual (Julia)

After the R pipeline completes, run the Julia supply estimation on an HPC cluster. The estimation uses a grid search with 64 different starting values (array job).

HPC cluster submission

Submit the array job using the shell script:

qsub julia/shell/main_julia.sh

The shell script (julia/shell/main_julia.sh) is configured for SGE clusters with:

  • 72-hour runtime limit, 40GB memory per task
  • Array job with 64 tasks (-t 1-64)
  • Each task runs julia --project julia/main.jl "outer-iter-150" 1 false

Post-estimation analysis

After all 64 estimation tasks complete, run the post-estimation analysis to:

  1. Select the best parameter estimates across starting values
  2. Compute standard errors (Table 8, Panel A)
  3. Generate model fit comparison (Table 8, Panel B)
  4. Plot convergence figures (Figure 3)
  5. Run counterfactual analysis (Table 9)
  6. Generate appendix figures
julia --project julia/main.jl "<JOB_ID>_outer-iter-150" 8 true

Replace <JOB_ID> with the job ID assigned by the HPC cluster (e.g., 138534).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors