Skip to content

ChenDelong1999/subobjects

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Delong Chen (陈德龙) Logo Logo,   Samuel Cahyawijaya Logo,   Jianfeng Liu (刘剑锋) Logo,  

Baoyuan Wang (王宝元) Logo,   Pascale Fung Logo Logo  

Logo Meta FAIR Paris    Logo Hong Kong University of Science and Technology     Logo Xiaobing.AI

teaser

Updates

  • 2025/07/04: Our paper is accepted to ICML 2025. We released a notebook for EPOC token segmentation.

  • 2025/03/12 (arXiv v3): We introduce a lightweight 🤗DirectSAM-b0 (only 3.7M parameters) and combined it with the Watershed algorithm, deriving the Efficient and PanOptiC (EPOC) tokenizer (EPOC = DirectSAM + Watershed). We provide both 🤗intrinsic evaluations and extensive VLM experiments to demonstrate the advantages of adaptive image tokenization.

  • 2024/04/24 (arXiv v2): We updated our paper with the Direct Segment Anything Model (DirectSAM), which efficiently generates comprehensive subobject segmentations with a single forward pass! Checkout our 🎬 demo video on YouTube or bilibili. The pretrained DirectSAM model is released on HuggingFace: 🤗DirectSAM-1800px-0424, and the training code is also available in this repo.

  • 2024/02/23 (arXiv v1): Our paper is featured in AK's 🤗Huggingface Daily Papers.

Visualizations

compare segmentations

DirectSAM visualizations

DirectSAM Inferece

  • Clone the repository

    git clone https://github.com/ChenDelong1999/subobjects.git
    cd subobjects
  • Install dependencies

    conda create -n subobjects python=3.11 -y
    conda activate subobjects
    pip install -r requirements.txt
  • Run DirectSAM on an example image

    import requests
    from PIL import Image
    from transformers import AutoModelForSemanticSegmentation, AutoImageProcessor
    from utils import inference_single_image, visualize_direct_sam_result
    
    checkpoint = "chendelong/DirectSAM-1800px-0424"
    
    image_processor = AutoImageProcessor.from_pretrained(checkpoint, reduce_labels=True)
    model = AutoModelForSemanticSegmentation.from_pretrained(checkpoint).to('cuda').eval()
    
    url = "http://images.cocodataset.org/val2017/000000002149.jpg"
    image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
    
    probs = inference_single_image(image, image_processor, model, resolution=None, pyramid_layers=0)
    visualize_direct_sam_result(probs, image, threshold=0.25)

The probs is the predicted boundary probabilities of the image, which is an ndarray of shape (height, width) between 0 and 1. The visualize_direct_sam_result function will show visualizations using matplotlib, where the threshold controls the binarization of the boundary probabilities.

Quality of segmentation can be improved by increasing the input resolution and the number of pyramid layers. The above two groups of figures are generated using resolution=3600, pyramid_layers=1/pyramid_layers=2, and threshold=0.03.

Using half-precision model.half() can speed up the inference and reduce the GPU memory requirement.

DirectSAM Training

We provide an example script to fine-tune DirectSAM on the ADE20K dataset. The implementation is based on 🤗 HuggingFace Trainer, please see this blog for a detailed tutorial.

The following command will start a distributed training with 512x512 resolution input and half-precision training, which takes around 9GB memory per GPU.

cd DirectSAM
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node 4 trainer.py

The following figures compare the segmentation results of DirectSAM before and after the above finetuning on ADE20K.

DirectSAM finetuning

Acknowledgements

Checkout amazing follow up works that used our model:

If you find our work useful, please consider citing:

@article{chen2024subobject,
  author       = {Delong Chen and
                  Samuel Cahyawijaya and
                  Jianfeng Liu and
                  Baoyuan Wang and
                  Pascale Fung},
  title        = {Subobject-level Image Tokenization},
  journal      = {CoRR},
  volume       = {abs/2402.14327},
  year         = {2024},
  url          = {https://doi.org/10.48550/arXiv.2402.14327},
  doi          = {10.48550/ARXIV.2402.14327},
  eprinttype    = {arXiv},
  eprint       = {2402.14327}
}

DirectSAM qingming

This repository is not released by Meta. The code and models are for research purposes only.

About

Official repository of paper "Subobject-level Image Tokenization" (ICML-25)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published