Skip to content

added rxrx3-core #13

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: trunk
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,4 @@ for public research by [Recursion Pharmaceuticals](recursionpharma.com).
- [RxRx19a -- Morphological Imaging Dataset of SARS-CoV-2 viral infection](/rxrx19a)
- [RxRx19b -- Morphological Imaging of the COVID-19-associated cytokine storm](/rxrx19b)
- [RxRx3 -- Phenomics Map of Biology](/rxrx3)
- [RxRx3-core -- Benchmarking Dataset for Phenomic Embedding Models](/rxrx3-core)
93 changes: 93 additions & 0 deletions rxrx3-core/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# RxRx3-core

At Recursion, we build maps of biology and chemistry to explore uncharted areas of disease biology, unravel its complexity, and industrialize drug discovery. Just as a map helps to navigate the physical world, our maps are designed to help us understand as much as we can about the connectedness of human biology so we can navigate the path to new medicines more efficiently.

Previously, we released [RxRx3](./../rxrx3/) a publicly available map of biology that contains images and deep learning-based embeddings for 17,063 genetic knockouts, as well as 1,674 known chemical entities at 8 doses each. RxRx3 is over 100Tb and 16,328 of the genes are anonymized, making it difficult to leverage as a benchmarking task for the research community. With this goal in mind, we're releasing **RxRx3-core**, a compressed subset of RxRx3 containing only unblinded perturbations (735 genetic knockouts and all 1,674 known chemical entities) along with a set of associated benchmarking tasks available on [github](https://github.com/recursionpharma/EFAAR_benchmarking).

**RxRx3-core is only 18GB and easily accesible via [Hugging Face](https://huggingface.co/datasets/recursionpharma/rxrx3-core)**.


## Metadata

The metadata can be found in `metadata_rxrx3_core.csv` and downloaded [from here](https://huggingface.co/datasets/recursionpharma/rxrx3-core/blob/main/metadata_rxrx3_core.csv). The schema of the metadata is as follows:

| Attribute | Description |
|-------------------|-----------------------------------------------------------------------------------------------------------------------|
| well_id | Experiment Name - Plate - Well (compound-004_1_AA04 or gene-088_9_Z43) |
| experiment_name | Experiment Name: Experiment number (compound-004 or gene-088)
| plate | Plate number in the experiment (1-48) | |
| address | Well location on the plate - "A01" to "AF48". |
| gene | Unblinded or anonymized gene name, or a control |
| treatment | Compound synonym or gene-name - guide-number (Narlaprevir or <gene_name>_guide_1)
| SMILES | Canonical SMILES or blank for non-compounds
| concentration | Compound concentration tested (in uM) |
| perturbation_type | CRISPR or COMPOUND |
| cell_type | HUVEC | |


### Metadata Example

To help understand the metadata, we have included some samples that some some of the more complex parts of the format to allow parser testing and validation

well_id,experiment_name,plate,address,gene,treatment,SMILES,concentration,perturbation_type,cell_type
gene-079_8_H29,gene-079,8,H29,RPLP2,RPLP2_guide_4,,,CRISPR,HUVEC
gene-045_4_AD27,gene-045,4,AD27,RXRX3-43938,RXRX3-43938_guide_6,,,CRISPR,HUVEC
gene-060_9_P28,gene-060,9,P28,EMPTY_control,EMPTY_control,,,CRISPR,HUVEC
compound-001_19_D20,compound-001,19,D20,,Dequalinium,"CC1=[N+](CCCCCCCCCC[N+]2=C(C)C=C(N)C3=CC=CC=C23)C2=CC=CC=C2C(N)=C1 |c:1,13,21,29,31,35,t:16,19,23,27|",0.25,COMPOUND,HUVEC
compound-001_11_U08,compound-001,11,U08,,EMPTY_control,,,COMPOUND,HUVEC
compound-004_43_B08,compound-004,43,B08,,CRISPR_control,,,COMPOUND,HUVEC

## Images

The images are found in [Hugging Face](https://huggingface.co/datasets/recursionpharma/rxrx3-core/tree/main/data) using the [WebDataset](https://huggingface.co/docs/hub/en/datasets-webdataset#streaming) format.
The image data were compressed from their original format to 512x512 (center crops) 8-bit `jp2` files. The image file names, such as `AA02_s1_3.jp2`, can be read as:

Well location on plate (column AA, row 2)
Site (1)
Channel (3)

All six channels (`1` - `6`) make up an single image of a given `site`. Note there is one site only for every well address.

Physical resolution: 0.65 micron/pixel.

## Deep Learning Embeddings

The deep learning embeddings computed using [OpenPhenom-S/16](https://huggingface.co/recursionpharma/OpenPhenom) are provided as `OpenPhenom_rxrx3_core_embeddings.parquet` and can be downloaded [from here](https://huggingface.co/datasets/recursionpharma/rxrx3-core/blob/main/OpenPhenom_rxrx3_core_embeddings.parquet) (this is ~ 532 MB). A notebook computing these embeddings with OpenPhenom-S/16 is provided [here](https://huggingface.co/recursionpharma/OpenPhenom/blob/main/RxRx3-core_inference.ipynb).

Each row in the parquet file has a `well_id` as described in the metadata schema. The remaining 384 columns are the embedding for that respective well.

## Accessing RxRx3-core through Hugging Face API

Loading the RxRx3-core image dataset.
An example of running inference on this dataset with OpenPhenom-S/16 is provided [here](https://huggingface.co/recursionpharma/OpenPhenom/blob/main/RxRx3-core_inference.ipynb).
```
from datasets import load_dataset
rxrx3_core = load_dataset("recursionpharma/rxrx3-core")
```
Loading OpenPhenom-S/16 embeddings and metadata for RxRx3-core
```
from huggingface_hub import hf_hub_download
import pandas as pd

file_path_metadata = hf_hub_download("recursionpharma/rxrx3-core", filename="metadata_rxrx3_core.csv",repo_type="dataset")
file_path_embs = hf_hub_download("recursionpharma/rxrx3-core", filename="OpenPhenom_rxrx3_core_embeddings.parquet",repo_type="dataset")

open_phenom_embeddings = pd.read_parquet(file_path_embs)
rxrx3_core_metadata = pd.read_csv(file_path_metadata)
```
Benchmarking code for this dataset is provided in the [EFAAR benchmarking repo](https://github.com/recursionpharma/EFAAR_benchmarking/tree/trunk).



## Changelog:
- Nov 2024: initial release

## License

This work is licensed under <a rel="license" href="https://rxrx3.rxrx.ai/static/">Recursion Non-Commercial End User License Agreement</a>

[github]: https://github.com/recursionpharma/rxrx-datasets/
[rxrx]: https://rxrx.ai
[rxrx3]: https://rxrx.ai/rxrx3
[recursion]: https://recursion.com
[download]: https://rxrx3.rxrx.ai/downloads