Skip to content

dcjones/proseg

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Proseg

Proseg (probabilistic segmentation) is a cell segmentation method for spatial transcriptomics. Xenium, CosMx, MERSCOPE, and Visium HD platforms are currently supported, but it can be easily adapted to others.

Crates.io Version Conda Docker

nf-core logo

Read the paper:

πŸ—Ž Jones, D.C., Elz, A.E., Hadadianpour, A. et al. Cell simulation as cell segmentation. Nat Methods (2025). https://doi.org/10.1038/s41592-025-02697-0

And the Research Brief:

πŸ—Ž Confronting the challenge of cell segmentation in spatial transcriptomics. Nat Methods (2025). https://doi.org/10.1038/s41592-025-02717-z

Table of Contents

Installing

Proseg can be built and installed with cargo by running.

cargo install proseg

The easiest way to install cargo for most is rustup.

From source

It can also be build manually from source, which is useful mainly if you want to try a specific revision or make changes

git clone https://github.com/dcjones/proseg.git
cd proseg
cargo build --release

Proseg can then be run with:

target/release/proseg

Migrating to Proseg 3

Proseg 3 has a few changes that users of earlier versions should be aware of.

  1. By default proseg will output a spatialdata zarr directory that can be read by the spatiadata package in python. There are many other output options still, but these are disabled by default.
    • This is admittedly a little less convenience for R users. My recommendation would be to convert the AnnData part to h5ad in python like:
      import spatialdata
      sdata = spatialdata.read_zarr("proseg-output.zarr")
      sdata.tables["table"].write_h5ad("proseg-anndata.h5ad")
      then read this h5ad file into R with zellkonverter. * The proseg-to-baysor command now operates on these zarr directories.
  2. By default, count matrices generated by Proseg 3 are integer point estimates, not continuous expected counts, which should make some downstream analysis simpler.
  3. Some simplifications were made to the model and sampling procedure. Now the sampling schedule is controlled with these four arguments: --burnin-samples, --samples giving the number of iterations, and --burnin-voxel-size and --voxel-size giving the x/y size of the voxels in microns. The burn in voxel size must be an integer multiple of the final voxel size.
  4. The --nbglayers arguments has been removed. There is now just one --voxel-layers argument controlling how many voxels are stacked on the z-axis.
  5. The voxel morphology prior has been changed. Instead of --perimeter-bound and --perimeter-eta, there is one --cell-compactness argument, where smaller numbers lead to more compact (equivalently, more circular) cells.

Usage

Proseg is run on a table of transcript positions which in some form must include preliminary assignments of transcripts to nuclei or cells. Xenium, CosMx, and MERSCOPE all provide this out of the box in some form.

Proseg is invoked from the command line like:

proseg [arguments...] /path/to/transcripts.csv.gz

The method is general purpose. There are command line arguments to tell it which columns in the csv file to use, and how they should be interpreted, but typically one of the presets --xenium, --cosmx, --merfish, or --visiumhd are used for common platforms.

Proseg is a sampling method, and in its current form in non-deterministic. From run to run, results will vary slightly, but not in a way that would seriously affect the interpretation of the data.

To see a list of command line arguments, run

proseg --help

Most of these can be disregarded. The most relevent ones will be described below:

Spatialdata output format

The spatialdata zarr output that proseg generates can be read with

import spatialdata
sdata = spatialdata.read_zarr("proseg-output.zarr")

This object contains:

  • Transcript positions and metadatai in sdata.points["transcripts"]. (This can use significant space, so can be excluded if not needed with --exclude-spatialdata-transcripts).
  • Cell polygons in sdata.shapes["cell_boundaries"]
  • Cell level information in AnnData format in sdata.tables["table"], which contains:
    • Cell metadata in obs
    • Gene metadata in var
    • Sparse cell-by-gene count matrix in X
    • Cell centroids in obsm["spatial"]
    • Some information about the proseg run in uns["proseg_run].

General arguments

  • --nthreads N sets the number of threads to parallelize across. By default proseg will use all available CPU cores, which may be a bad idea on a shared machine.
  • --output-spatialdata output.zarr: Proseg will output a spatialdata zarr directory which can be read by the spatialdata python package and contains all metadata, count matrix, and cell geometry.
  • --overwrite: An existing zarr directory will not be overwritten unless this argument is passed.
  • --voxel-layers N: Number of layers on the z-axis to model 3D cells.
  • --samples N: Run the sampler for this N iterations.
  • --burnin-samples N: Run the sampler for a preliminary N samples at a lower resolution.
  • --voxel-size S: Voxel size in microns on the x/y axis.
  • --burnin-voxel-size S: Larger voxel size to use for the burn-in phase. (This must be an integer multiple of the final voxel size).

Output arguments

In addition to the spatialdata zarr output, results can be written to separate number of tables, which can be either gzipped csv files or parquet files, and GeoJSON files giving cell boundaries.

  • --output-counts counts.mtx.gz: Output a cell-by-gene count matrix in gziped matrix market format. (Which can be read with e.g. mmread in scipy).
  • --output-expected-counts expected-counts.mtx.gz: Output an expected count matrix, where the counts are non-integer estimates from taking the mean over multiple samples.
  • --output-cell-metadata cell-metadata.csv.gz: Cell centroids, volume, and other information.
  • --output-transcript-metadata transcript-metadata.csv.gz: Transcript ids, genes, revised positions, assignment probability, etc.
  • --output-gene-metadata gene-metadata.csv.gz: Per-gene summary statistics
  • --output-rates rates.csv.gz: Cell-by-gene Poisson rate parameters. These are essentially expected relative expression values, but may be too overly-smoothed for use in downstream analysis.

Cell boundaries can be output a number of ways:

  • --output-cell-polygons cell-polygons.geojson.gz: 2D consensus polygons for each cell in GeoJSON format. These are flattened from 3D, which each xy position assigned to the dominant cell.
  • --output-cell-polygon-layers cell-polygons-layers.geojson.gz: Output a separate, non-overlapping cell polygon for each z-layer, preserving 3D segmentation.

Model arguments

A number of options can alter assumptions made by the model, which generally should not need

  • --ncomponents 10: Cell gene expression is a modeled as a mixture of negative binomial distributions. This parameter controls the number of mixture components. More components will tend to nudge the cells into more distinct types, but setting it too high risks manifesting cell types that are not real.
  • --no-diffusion: By default Proseg models cells as leaky, under the assumption that some amount of RNA leaks from cells and diffuses elsewhere. This seems to be the case in much of the Xenium data we've seen, but could be a harmfully incorrect assumption in some data. This argument disables that part of the model.
  • --diffusion-probability 0.2: Prior probability of a transcript is diffused and should be repositioned.
  • --diffusion-sigma-far 4: Prior standard deviation on repositioning distance of diffused transcripts.
  • --diffusion-sigma-near 1: Prior standard deviation on repositioning distance of non-diffused transcripts..
  • --nuclear-reassignment_prob 0.2: Prior probability that the initial nuclear assignment (if any) is incorrect.
  • --cell-compactness 0.03: Larger numbers allow less spherical cells.

General advice

  • If the dataset consists of multiple unrelated tissue sections, it's generally safer to split these and segment them separately. This avoids potential sub-optimal modeling due to different z-coordinate distributions.
  • If proseg crashes without any kind of error message, or if it suddenly starts running extremely slowly, that's usually a sign that it ran out of memory. The best option would be to run on a system with more memory, but increasing the voxel size, lowering the number of voxel layers, and disabling diffusion will all reduce the memory usage, typically at the cost of lower transcript assignment accuracy.
  • Proseg is aided by doing local repositioning transcripts (I call this "transcript repo" or "transcript diffusion"). If your skeptical that transcripts leak or diffuse, it's still useful to allow some amount of repositioning to overcome limitations with voxel resolution. Setting --diffusion-probability 0.0 will still let transcripts make small scale adjustments, unlike --no-diffusion which will completely disable that part of the model.

Xenium

Proseg will work on Xenium transcript tables in either csv.gz or parquet format. The latter will be slightly more efficient to read.

proseg --xenium transcripts.csv.gz
# or
proseg --xenium transcripts.parquet

Importing into Xenium Explorer

After segmenting with Proseg, the data can be converted back to format readable by Xenium Explorer.

  1. First convert the proseg output to a format matching Baysor's using the included proseg-to-baysor command: sh proseg-to-baysor proseg-output.zarr \ --output-transcript-metadata proseg-to-baysor-transcript-metadata.csv --output-cell-polygons proseg-to-baysor-cell-polygons.geojson
  2. This can then be converted to a xenium bundle with Xenium Ranger. sh xeniumranger import-segmentation \ --id sample_id \ --xenium-bundle /path/to/xenium/bundle \ --transcript-assignment proseg-to-baysor-transcript-metadata.csv \ --viz-polygons proseg-to-baysor-cell-polygons.geojson \ --units=microns where /path/to/xenium/bundle is the original xenium bundle.
  3. Xenium Explorer should be able to read the resulting xenium bundle which will be written to a directory named whatever is passed to --id.

Known issues:

  • Problems importing can arise if transcripts with qv scores below 20 are not filtered out. This is done by default in Proseg, but lowering this cutoff could cause issues.
  • Earlier versions of Xenium Ranger/Explorer tend to mangle polygons generated by Proseg. Since Proseg is voxel based, any cell boundary that isn't axis aligned (i.e. composed of vertical and horizontal line segments), is due to Xenium software mis-rendering them.

CosMx

Proseg works on CosMx transcript table with

proseg --cosmx sample_tx_file.csv

Legacy CosMx data

Earlier versions of CosMx did not automatically provide a single table of global transcript positions. To work around this, we provide a Julia program in extra/stitch-cosmx.jl to construct a table from the flat files downloaded from AtoMx.

To run this, some dependencies are required, which can be installed with

julia -e 'import Pkg; Pkg.add(["Glob", "CSV", "DataFrames", "CodecZlib", "ArgParse"])'

Then the program can be run with like

julia stitch-cosmx.jl /path/to/cosmx-flatfiles transcripts.csv.gz

to output a complete transcripts table to transcripts.csv.gz.

From here proseg can be run with (note the --cosmx-micron not --cosmx, since this format differs slightly)

proseg --cosmx-micron transcripts.csv.gz

MERSCOPE

Proseg should work on the provide transcripts table with --merscope:

proseg --merscope detected_transcripts.cs.gz

VisiumHD

Running VisiumHD on Proseg is slightly more complicated that in situ methods. First, it needs to read a sparse count matrix across bins/squares. Instead of passing a transcripts table to proseg, it should be pointed to the binned_outputs directory

Proseg still requires some form of registered image segmentation. The easiest way is to use the tools provided by Space Ranger to handle segmentation in image Visium HD, which provides a barcode_mappings.parquet file that Proseg can read.

proseg --visiumhd
    --spaceranger-barcode-mappings /path/to/barcode_mappings.parquet \
    /path/to/binned_outputs \

Alternatively, it's possible, though trickier to use a Cellpose mask, assuming it's correctly registered.

proseg --visiumhd
    --cellpose-masks /path/to/cellpose_masks.npy \
    /path/to/binned_outputs \

There isn't a standardized cellpose output format. See the extras/cellpose-xenium.py file for code to run cellpose and output to format proseg can read.

Initializing using Cellpose masks

Typically Proseg is run with prior segmentation in the form of a transcript table with prior cell assignments. It's also possible to initialize directly using Cellpose (or similar) masks, and there is potentially some benefit to doing this, because it can provide a more detailed pixel-level prior to proseg.

This does require more effort and care, since it's critical that coordinates are properly registered and scaled from mask pixels to transcripts micron positions. There are the relevant arguments.

  • --cellpose-masks masks.npy.gz: An (optionally gzipped) npy file with a 2d masks array with uint32 type, where 0 represents unassigned state and [1, ncells] gives the pixels cell assignment.
  • --cellpose-cellprobs cellprobs.npy.gz: An (optionally gzipped) npy file giving segmentation uncertainty. This should be a float32 array matching the dimensionality of the masks array.
  • --cellpose-scale 0.324: Microns per pixel of the cellpose mask.
  • --cellpose-x-transform a b c: Affine transformation from pixels to microns for the x-coordinate (where x_micron = a*x_pixel + b*y_pixel + c)
  • --cellpose-y-transform a b c: Affine transformation from pixels to microns from the y-coordinate (where y_micron = a*x_pixel + b*y_pixel + c)

Getting help

Don't hesitate to open an issue or to email me (email is in my bio). This data is complex and varied. Weird cases do arise which may not be correctly considered by Proseg, so I appreciate being made aware of them.

About

Probabilistic cell segmentation for in situ spatial transcriptomics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Languages