Proseg (probabilistic segmentation) is a cell segmentation method for spatial transcriptomics. Xenium, CosMx, MERSCOPE, and Visium HD platforms are currently supported, but it can be easily adapted to others.

Read the paper:
And the Research Brief:
Proseg can be built and installed with cargo by running.
cargo install proseg
The easiest way to install cargo for most is rustup.
It can also be build manually from source, which is useful mainly if you want to try a specific revision or make changes
git clone https://github.com/dcjones/proseg.git
cd proseg
cargo build --release
Proseg can then be run with:
target/release/proseg
Proseg 3 has a few changes that users of earlier versions should be aware of.
- By default proseg will output a
spatialdata zarr directory
that can be read by the spatiadata package in python. There are many other output
options still, but these are disabled by default.
- This is admittedly a little less convenience for R users. My recommendation would
be to convert the AnnData part to h5ad in python like:
then read this h5ad file into R with zellkonverter. * The
import spatialdata sdata = spatialdata.read_zarr("proseg-output.zarr") sdata.tables["table"].write_h5ad("proseg-anndata.h5ad")
proseg-to-baysor
command now operates on these zarr directories.
- This is admittedly a little less convenience for R users. My recommendation would
be to convert the AnnData part to h5ad in python like:
- By default, count matrices generated by Proseg 3 are integer point estimates, not continuous expected counts, which should make some downstream analysis simpler.
- Some simplifications were made to the model and sampling procedure. Now the
sampling schedule is controlled with these four arguments:
--burnin-samples
,--samples
giving the number of iterations, and--burnin-voxel-size
and--voxel-size
giving the x/y size of the voxels in microns. The burn in voxel size must be an integer multiple of the final voxel size. - The
--nbglayers
arguments has been removed. There is now just one--voxel-layers
argument controlling how many voxels are stacked on the z-axis. - The voxel morphology prior has been changed. Instead of
--perimeter-bound
and--perimeter-eta
, there is one--cell-compactness
argument, where smaller numbers lead to more compact (equivalently, more circular) cells.
Proseg is run on a table of transcript positions which in some form must include preliminary assignments of transcripts to nuclei or cells. Xenium, CosMx, and MERSCOPE all provide this out of the box in some form.
Proseg is invoked from the command line like:
proseg [arguments...] /path/to/transcripts.csv.gz
The method is general purpose. There are command line arguments to tell it which
columns in the csv file to use, and how they should be interpreted, but
typically one of the presets --xenium
, --cosmx
, --merfish
, or --visiumhd
are used for common platforms.
Proseg is a sampling method, and in its current form in non-deterministic. From run to run, results will vary slightly, but not in a way that would seriously affect the interpretation of the data.
To see a list of command line arguments, run
proseg --help
Most of these can be disregarded. The most relevent ones will be described below:
The spatialdata zarr output that proseg generates can be read with
import spatialdata
sdata = spatialdata.read_zarr("proseg-output.zarr")
This object contains:
- Transcript positions and metadatai in
sdata.points["transcripts"]
. (This can use significant space, so can be excluded if not needed with--exclude-spatialdata-transcripts
). - Cell polygons in
sdata.shapes["cell_boundaries"]
- Cell level information in AnnData format in
sdata.tables["table"]
, which contains:- Cell metadata in
obs
- Gene metadata in
var
- Sparse cell-by-gene count matrix in
X
- Cell centroids in
obsm["spatial"]
- Some information about the proseg run in
uns["proseg_run]
.
- Cell metadata in
--nthreads N
sets the number of threads to parallelize across. By default proseg will use all available CPU cores, which may be a bad idea on a shared machine.--output-spatialdata output.zarr
: Proseg will output a spatialdata zarr directory which can be read by the spatialdata python package and contains all metadata, count matrix, and cell geometry.--overwrite
: An existing zarr directory will not be overwritten unless this argument is passed.--voxel-layers N
: Number of layers on the z-axis to model 3D cells.--samples N
: Run the sampler for this N iterations.--burnin-samples N
: Run the sampler for a preliminary N samples at a lower resolution.--voxel-size S
: Voxel size in microns on the x/y axis.--burnin-voxel-size S
: Larger voxel size to use for the burn-in phase. (This must be an integer multiple of the final voxel size).
In addition to the spatialdata zarr output, results can be written to separate number of tables, which can be either gzipped csv files or parquet files, and GeoJSON files giving cell boundaries.
--output-counts counts.mtx.gz
: Output a cell-by-gene count matrix in gziped matrix market format. (Which can be read with e.g.mmread
in scipy).--output-expected-counts expected-counts.mtx.gz
: Output an expected count matrix, where the counts are non-integer estimates from taking the mean over multiple samples.--output-cell-metadata cell-metadata.csv.gz
: Cell centroids, volume, and other information.--output-transcript-metadata transcript-metadata.csv.gz
: Transcript ids, genes, revised positions, assignment probability, etc.--output-gene-metadata gene-metadata.csv.gz
: Per-gene summary statistics--output-rates rates.csv.gz
: Cell-by-gene Poisson rate parameters. These are essentially expected relative expression values, but may be too overly-smoothed for use in downstream analysis.
Cell boundaries can be output a number of ways:
--output-cell-polygons cell-polygons.geojson.gz
: 2D consensus polygons for each cell in GeoJSON format. These are flattened from 3D, which each xy position assigned to the dominant cell.--output-cell-polygon-layers cell-polygons-layers.geojson.gz
: Output a separate, non-overlapping cell polygon for each z-layer, preserving 3D segmentation.
A number of options can alter assumptions made by the model, which generally should not need
--ncomponents 10
: Cell gene expression is a modeled as a mixture of negative binomial distributions. This parameter controls the number of mixture components. More components will tend to nudge the cells into more distinct types, but setting it too high risks manifesting cell types that are not real.--no-diffusion
: By default Proseg models cells as leaky, under the assumption that some amount of RNA leaks from cells and diffuses elsewhere. This seems to be the case in much of the Xenium data we've seen, but could be a harmfully incorrect assumption in some data. This argument disables that part of the model.--diffusion-probability 0.2
: Prior probability of a transcript is diffused and should be repositioned.--diffusion-sigma-far 4
: Prior standard deviation on repositioning distance of diffused transcripts.--diffusion-sigma-near 1
: Prior standard deviation on repositioning distance of non-diffused transcripts..--nuclear-reassignment_prob 0.2
: Prior probability that the initial nuclear assignment (if any) is incorrect.--cell-compactness 0.03
: Larger numbers allow less spherical cells.
- If the dataset consists of multiple unrelated tissue sections, it's generally safer to split these and segment them separately. This avoids potential sub-optimal modeling due to different z-coordinate distributions.
- If proseg crashes without any kind of error message, or if it suddenly starts running extremely slowly, that's usually a sign that it ran out of memory. The best option would be to run on a system with more memory, but increasing the voxel size, lowering the number of voxel layers, and disabling diffusion will all reduce the memory usage, typically at the cost of lower transcript assignment accuracy.
- Proseg is aided by doing local repositioning transcripts (I call this
"transcript repo" or "transcript diffusion"). If your skeptical that
transcripts leak or diffuse, it's still useful to allow some amount of
repositioning to overcome limitations with voxel resolution. Setting
--diffusion-probability 0.0
will still let transcripts make small scale adjustments, unlike--no-diffusion
which will completely disable that part of the model.
Proseg will work on Xenium transcript tables in either csv.gz
or parquet
format. The latter will be slightly more efficient to read.
proseg --xenium transcripts.csv.gz
# or
proseg --xenium transcripts.parquet
After segmenting with Proseg, the data can be converted back to format readable by Xenium Explorer.
- First convert the proseg output to a format matching Baysor's using the included
proseg-to-baysor
command:sh proseg-to-baysor proseg-output.zarr \ --output-transcript-metadata proseg-to-baysor-transcript-metadata.csv --output-cell-polygons proseg-to-baysor-cell-polygons.geojson
- This can then be converted to a xenium bundle with Xenium Ranger.
sh xeniumranger import-segmentation \ --id sample_id \ --xenium-bundle /path/to/xenium/bundle \ --transcript-assignment proseg-to-baysor-transcript-metadata.csv \ --viz-polygons proseg-to-baysor-cell-polygons.geojson \ --units=microns
where/path/to/xenium/bundle
is the original xenium bundle. - Xenium Explorer should be able to read the resulting xenium bundle which
will be written to a directory named whatever is passed to
--id
.
Known issues:
- Problems importing can arise if transcripts with qv scores below 20 are not filtered out. This is done by default in Proseg, but lowering this cutoff could cause issues.
- Earlier versions of Xenium Ranger/Explorer tend to mangle polygons generated by Proseg. Since Proseg is voxel based, any cell boundary that isn't axis aligned (i.e. composed of vertical and horizontal line segments), is due to Xenium software mis-rendering them.
Proseg works on CosMx transcript table with
proseg --cosmx sample_tx_file.csv
Earlier versions of CosMx did not automatically provide a single table of global
transcript positions. To work around this, we provide a Julia program in
extra/stitch-cosmx.jl
to construct a table from the flat files downloaded from
AtoMx.
To run this, some dependencies are required, which can be installed with
julia -e 'import Pkg; Pkg.add(["Glob", "CSV", "DataFrames", "CodecZlib", "ArgParse"])'
Then the program can be run with like
julia stitch-cosmx.jl /path/to/cosmx-flatfiles transcripts.csv.gz
to output a complete transcripts table to transcripts.csv.gz
.
From here proseg can be run with (note the --cosmx-micron
not --cosmx
, since this format differs slightly)
proseg --cosmx-micron transcripts.csv.gz
Proseg should work on the provide transcripts table with --merscope
:
proseg --merscope detected_transcripts.cs.gz
Running VisiumHD on Proseg is slightly more complicated that in situ methods. First, it needs
to read a sparse count matrix across bins/squares. Instead of passing a transcripts table to proseg,
it should be pointed to the binned_outputs
directory
Proseg still requires some form of registered image segmentation. The easiest way is
to use the tools provided by Space Ranger to handle segmentation in image
Visium HD,
which provides a barcode_mappings.parquet
file that Proseg can read.
proseg --visiumhd
--spaceranger-barcode-mappings /path/to/barcode_mappings.parquet \
/path/to/binned_outputs \
Alternatively, it's possible, though trickier to use a Cellpose mask, assuming it's correctly registered.
proseg --visiumhd
--cellpose-masks /path/to/cellpose_masks.npy \
/path/to/binned_outputs \
There isn't a standardized cellpose output format. See the extras/cellpose-xenium.py
file for code
to run cellpose and output to format proseg can read.
Typically Proseg is run with prior segmentation in the form of a transcript table with prior cell assignments. It's also possible to initialize directly using Cellpose (or similar) masks, and there is potentially some benefit to doing this, because it can provide a more detailed pixel-level prior to proseg.
This does require more effort and care, since it's critical that coordinates are properly registered and scaled from mask pixels to transcripts micron positions. There are the relevant arguments.
--cellpose-masks masks.npy.gz
: An (optionally gzipped) npy file with a 2d masks array with uint32 type, where 0 represents unassigned state and[1, ncells]
gives the pixels cell assignment.--cellpose-cellprobs cellprobs.npy.gz
: An (optionally gzipped) npy file giving segmentation uncertainty. This should be a float32 array matching the dimensionality of the masks array.--cellpose-scale 0.324
: Microns per pixel of the cellpose mask.--cellpose-x-transform a b c
: Affine transformation from pixels to microns for the x-coordinate (wherex_micron = a*x_pixel + b*y_pixel + c
)--cellpose-y-transform a b c
: Affine transformation from pixels to microns from the y-coordinate (wherey_micron = a*x_pixel + b*y_pixel + c
)
Don't hesitate to open an issue or to email me (email is in my bio). This data is complex and varied. Weird cases do arise which may not be correctly considered by Proseg, so I appreciate being made aware of them.