TextEraser

Text-Guided Object Removal using SAM2 + CLIP + Stable Diffusion XL

Final Project for COMPSCI372: Intro to Applied Machine Learning @ Duke University (Fall 2025)

What It Does

TextEraser is a multimodal AI pipeline that allows users to surgically remove objects from images using natural language (e.g., "remove the bottle"). It integrates YOLO-World for open-vocabulary detection, SAM 2 for precise segmentation, CLIP for semantic verification, and SDXL Inpainting for background synthesis, effectively automating the "detect-mask-inpaint" workflow without manual intervention.

Motivation: Why This Matters

The Problem: Destructive Regeneration Current generative models (like DALL-E or built-in editors in ChatGPT) often suffer from "destructive regeneration." When asked to modify a small part of an image, they typically regenerate the entire scene from scratch, altering the lighting, composition, and details of the original photo.

Our Solution: Non-Destructive Editing TextEraser addresses this by strictly isolating the edit. Unlike standard generation:

Precision: It uses segmentation (SAM 2) to cut the exact edge of the object, not just a bounding box.
Preservation: It freezes the rest of the image, ensuring 100% of the non-target pixels remain identical to the original.
Context: It uses inpainting to fill only the void with background texture that mathematically matches the scene.

Video Links

Demo Video: Demo Video
Technical Walkthrough: Technical Walkthrough

Live Demo

Hugging Face Space: Try TextEraser Online

Note: Due to hardware limitations on the hosting platform (free tier GPU), the online demo currently supports Segmentation Only (Detection + Masking). To experience the full pipeline with SDXL Inpainting, please run the project locally.

Quick Start

Please refer to SETUP.md for detailed installation instructions.

Evaluation & Results

Qualitative Comparison

We evaluated the pipeline by comparing it against standard image-guided generation methods.

Case 1: Complex Object Removal

Observation: TextEraser (SAM2 + SDXL) successfully generates plausible background textures where traditional methods leave blur artifacts.

Input Image	Instruct-Pix2Pix	CosXL	TextEraser (Ours)

(Note: See the examples/ folder for full resolution images)

Model Design Choices (Justification)

SAM 2 vs. YOLO (Segmentation): While standard YOLO segmentation models are efficient, they are typically limited to a fixed set of 80 classes (COCO dataset). We selected SAM 2 because it is class-agnostic; it can generate precise masks for any object detected by our open-vocabulary pipeline, enabling true "zero-shot" removal of unlimited object types.
SDXL vs. Other Generative Models: We prioritized Stable Diffusion XL (SDXL) over older diffusion models (like SD 1.5) or GAN-based inpainters (like LaMa). SDXL natively supports higher resolutions (1024x1024) and demonstrates superior semantic understanding, allowing it to hallucinate realistic textures (e.g., brick patterns, foliage) where other models often produce blurry or repetitive artifacts.

Individual Contributions

Solo Project This project was designed, implemented, and documented entirely by Xuting Zhang. All code, including the integration of the detection-segmentation-inpainting pipeline, is original work.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
examples		examples
models		models
src		src
.gitignore		.gitignore
ATTRIBUTION.md		ATTRIBUTION.md
LICENSE		LICENSE
README.md		README.md
SETUP.md		SETUP.md
app.py		app.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TextEraser

What It Does

Motivation: Why This Matters

Video Links

Live Demo

Quick Start

Evaluation & Results

Qualitative Comparison

Case 1: Complex Object Removal

Model Design Choices (Justification)

Individual Contributions

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TextEraser

What It Does

Motivation: Why This Matters

Video Links

Live Demo

Quick Start

Evaluation & Results

Qualitative Comparison

Case 1: Complex Object Removal

Model Design Choices (Justification)

Individual Contributions

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages