Avatar-Product Interaction Image Generation

Introduction

This repository contains a pipeline designed to automatically generate realistic images of an avatar interacting with a product. Given an input product image and a portrait of a person, the solution synthesizes high-quality images where the avatar naturally uses or showcases the product.

A gallery of examples is available in the assets/demo folder.

Pipeline Overview

The implemented pipeline leverages open-source models and operates in two sequential stages.

Stage 1: Identity-Conditioned Image Generation

This stage generates a base image of the avatar while preserving identity features. The model takes an ID image of the person and a text prompt describing the desired appearance, pose, or context. It synthesizes a high-quality image where the avatar closely resembles the provided ID while adhering to the textual description.

This stage is built upon the InfiniteYou framework, leveraging DiTs (FLUX) for flexible, high-fidelity, and identity-preserved image generation. The process requires a reference identity image and a descriptive textual prompt that specifies the desired appearance, pose, and context, especially tailored to showcase or interact with the intended product.

Guidelines for Optimal Results:

Prompting: Create prompts clearly specifying how the avatar should showcase, hold, or wear the product. Good images are generally produced immediately, but if the product is not distinctly visible, provide more detailed prompts explicitly mentioning its position or interaction.
Adjustments: Typically, significant adjustments are unnecessary. However, if the generated image does not align closely with the provided prompt, try increasing --infusenet_guidance_start slightly (e.g., set to 0.1). If results are still unsatisfactory, slightly decrease --infusenet_conditioning_scale (e.g., set to 0.9).

Example command:

python -m scripts.generate_id_image --id_image_path data/avatars/1.png --prompt 'A young woman wearing a t-shirt on a monotone background, 4K, high quality, photorealistic' --output_image_path 'results/stage_one/1_t-shirt.png' --optimize_vram

Stage 2: Product-Integrated Image Editing

In this stage, the generated avatar image is refined to incorporate the target product. The stage receives the output from Stage 1, along with the product name and an additional target text prompt.

This stage leverages the approach described in the paper "Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator" using Diptych Prompting. This method reframes subject-driven image generation as an inpainting task, employing large-scale text-to-image models for precise, zero-shot integration and alignment of subjects (products in this context).

This stage receives the base avatar image (output from Stage 1), a reference product image, the product name, and a descriptive text prompt. Leveraging the product name, the pipeline utilizes Grounding DINO and Segment Anything (SAM) to detect and segment the product in both the reference product image and the base avatar image. Next, a diptych image is created. The left panel contains the reference product image, while the right panel holds the base avatar image with a masked region indicating the area for inpainting. FLUX, equipped with a ControlNet module, then performs text-conditioned inpainting on the masked area in the right panel, referencing the product from the left panel.

Guidelines for Optimal Results:

If the integrated product's appearance differs significantly from the reference image, consider increasing the attn_enforce parameter slightly (e.g., set to 1.1) to reinforce product alignment. Note that setting this parameter to a value other than 1.0 increases memory usage, as flash attention implementation becomes unavailable and a custom attention implementation is used instead.
If insufficient space is available in the base avatar image to incorporate the product naturally, increase the context_px parameter to provide more area for inpainting the product seamlessly.
By default, avatar identity preservation is enforced by preventing inpainting over facial regions. However, for certain products like glasses, you can disable this restriction using the --mask_face flag.
For optimal results, generating multiple images using different random seeds can be beneficial, allowing selection from various outputs.

Example command:

PYTHONPATH='thirdparty/DiptychPrompting/' python -m scripts.integrate_product --product_image_path data/products/t-shirt-2.jpeg --id_image_path results/stage_one/1_t-shirt.png --product_name 't-shirt' --target_prompt 'a woman wearing a t-shirt' --output_image_path 'results/stage_two/1_t-shirt.png' --optimize_vram

Limitations

Product Appearance Variations: The generated product may not be an exact replica of the reference image. For instance, subtle details — such as logos, textures, or small design elements — might differ, as the pipeline relies on zero-shot subject-driven image generation rather than direct copy-pasting of the product. Training the model with additional subject-driven adaptation methods or integrating more explicit structure-preserving techniques could improve product fidelity.
Challenges with Face-Worn Products: Items like glasses, which require precise integration with facial features, pose a challenge. Since the second stage does not have direct access to the original ID image, inpainting over the face can alter the avatar's identity. This is an inherent limitation of the pipeline's current structure.

Examples of these limitations:

The Baby Yoda design in the generated image may differ slightly from the original product.
The hat in the generated image differs in design from the original product.
Intricate patterns on a t-shirt on the generated image might look slightly different from the original product.
The sunglasses in the generated image show slight variations from the original product.

Further Application: Video Generation

The images generated by this pipeline can serve as inputs for subsequent video generation, enabling the creation of dynamic avatar-product interaction sequences. While video generation is not the focus of this repository, this pipeline provides a strong foundation by ensuring high-quality identity-preserved static images.

As an example, here is a short video generated using MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance, where the input image was generated by this pipeline.

Note: In this example, the identity of the person in the video is poorly preserved, highlighting a common challenge in motion-based generation. However, this demonstrates how the current pipeline can serve as a first step for future video-based applications, though additional techniques may be required to maintain identity consistency.

Alternative Approaches Considered

During the development of this pipeline, multiple open-source models and frameworks were tested. Below are three of the most notable alternatives considered:

PhotoMaker

PhotoMaker is a tuning-free text-to-image model that enables personalized generation without additional fine-tuning. It takes a few images of a person and a text prompt, utilizing a pretrained identity encoder and LoRA on top of Stable Diffusion RealVisXL_V4.0 checkpoint. It natively supports IP-Adapter, T2I-Adapter, ControlNet, and LoRAs.

During initial testing, the image quality and identity preservation were reasonable, but since it is based on Stable Diffusion XL, it performed worse than FLUX, which was ultimately selected for this pipeline.. Additionally, various conditioning techniques, such as ControlNet keypoints and IP-Adapter, were explored to improve control over generation, but the results remained unsatisfactory.

InteractDiffusion

InteractDiffusion is a model that operates on a triplet label (person, interaction, subject). The user additionally provides bounding boxes for the person and subject, and the framework infers the interaction region. It offers good control over scene composition, making it useful for structured interactions. However, it lacks out-of-the-box mechanisms for identity or subject conditioning and, like PhotoMaker, it is based on Stable Diffusion XL.

OminiControl

OminiControl is a flexible control framework built specifically for Diffusion Transformer (DiT) models, offering support for both subject-driven control and spatial control. Since it is based on FLUX, it inherits the advantages of high-quality generation.

During the initial testing, preservation was decent for simple objects like plain t-shirts (without prints) and basic mugs (without text).

However, for more complex objects, such as t-shirts with patterns or phones, the details were poorly preserved, making it unsuitable for the required product integration task.

Installation

First, clone the submodules:

make sync-submodules

Next, there are two options for setting up the environment:

Option 1: Docker (Recommended)

make docker-build
make docker-run

Option 2: Conda Environment

conda create -n avatar_product python=3.10
conda activate avatar_product

apt-get update && apt-get install ffmpeg libsm6 libxext6  -y
pip install -r requirements.txt

Login to Hugging Face

Finally, login to Hugging Face to be able to download the models:

huggingface-cli login --token $HF_TOKEN

Memory Requirements

The pipeline requires significant GPU memory, particularly for the Flux model. The pipeline requires 40GB of VRAM with the --optimize_vram flag enabled. If you don't have enough VRAM, you may reduce the height and width.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
assets		assets
data		data
scripts		scripts
src		src
thirdparty		thirdparty
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
ruff.toml		ruff.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Avatar-Product Interaction Image Generation

Table of Contents

Introduction

Pipeline Overview

Stage 1: Identity-Conditioned Image Generation

Stage 2: Product-Integrated Image Editing

Limitations

Further Application: Video Generation

Alternative Approaches Considered

PhotoMaker

InteractDiffusion

OminiControl

Installation

Option 1: Docker (Recommended)

Option 2: Conda Environment

Login to Hugging Face

Memory Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Avatar-Product Interaction Image Generation

Table of Contents

Introduction

Pipeline Overview

Stage 1: Identity-Conditioned Image Generation

Stage 2: Product-Integrated Image Editing

Limitations

Further Application: Video Generation

Alternative Approaches Considered

PhotoMaker

InteractDiffusion

OminiControl

Installation

Option 1: Docker (Recommended)

Option 2: Conda Environment

Login to Hugging Face

Memory Requirements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages