Citibike-Sampler

A Python tool to facilitate work with data from NYC's Citi Bike network.

Why use this?

Data from the 'Citi Bike' system in NYC captures real-world patterns of urban mobility at very high resolution. As such, the data is widely used in research and practical applications.

However, working with the raw source data can be tedious. In a single year, the Citi Bike system records tens of millions of bike rides, equating to several GB worth of data. Furthermore, historical trip records are spread over hundreds of CSV files that use an inconsistent archive layout over time (annual bundles before 2024, monthly archives after).

Citibike-Sampler streamlines your workflow by providing:

a convenient data downloader with consistent local caching;
a data loader for accessing the full trip records; and
a random sampler to draw representative subsets of the full Citi Bike data spanning multiple months or years.

Random sampling allows you to quickly explore multi-year trends in the Citi Bike data, without having to load hundreds of millions of records into memory.

Installation

pip

Citibike-Sampler is available on PyPI and can be installed using pip:

pip install citibike-sampler

pipx (for CLI use)

If you only need data sampling from the command-line, installation is best done using pipx:

pipx install git+https://github.com/lungoruscello/Citibike-Sampler.git

Usage

Python API

from citibike_sampler import sample, load_all, get_cache_dir

# Randomly sample 1% of all trip records from the first half of 2025.
# (Will automatically download data from AWS if not already cached.)
sample_df = sample(start='2025-1', end='2025-6', fraction=0.01, seed=42)

# Plot daily aggregates of sampled trips (assumes matplotlib is available)
sample_df.set_index('ended_at').resample('1D').ride_id.count().plot()

# Load the full dataset (be careful: millions of rides per month!)
full_df = load_all(start='2025-1', end='2025-6') 

print(len(sample_df) / len(full_df))  # check the sampling fraction

print(get_cache_dir())  # inspect the local cache location

CLI

Generate a random sample of Citi Bike data directly from the terminal:

cbike_sampler --start 2025-1 --end 2025-6 --fraction 0.01 --seed 42 --output sampled.csv

This will create a sampled.csv file containing roughly 1% of all trip records from the first half of 2025. To store the sampling result as a Feather or Parquet file, simply change the suffix of the output filename accordingly (e.g., sampled.parquet).

Requirements

Python 3.9 or higher
requests
pandas
tqdm
pyarrow (optional, for Parquet/Feather export)

Licence

MIT Licence. See LICENSE.txt for details.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
assets		assets
citibike_sampler		citibike_sampler
tests		tests
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Citibike-Sampler

Why use this?

Installation

pip

pipx (for CLI use)

Usage

Python API

CLI

Requirements

Licence

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Citibike-Sampler

Why use this?

Installation

pip

pipx (for CLI use)

Usage

Python API

CLI

Requirements

Licence

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages