Sweet data-centric foundation model fine-tuning
Explore the docs »
Stable Diffusion is a text-to-image model released by Stability AI. It's based on the latent diffusion architecture from Robin Rombach et al., which is an efficient diffusion model that works in the latent space. It consists of a text encoder that encodes the text into a latent vector, an Autoencoder, that projects the input image into a lower resolution latent space (and reconstructs the original image from it) and a U-Net that drives the diffusion process in latent space.
The latent diffusion model was trained on a large dataset of text-image pairs for text-conditioned image generation. The model is available on the Hugging Face model hub (SD v1.5, SD v2.0, SD v2.1).
Since Stable Diffusion is a text-to-image model, it is trained on text-image pairs that contain a similar content. These text-image pairs can be found on the internet, by looking for the caption of images, since they should describe the content of the image. Another way of getting these pairs is by human annotation, which can produce very high quality data, but is also very expensive and time consuming. Luckily, researchers and organisations have spend plenty of time researching the domain of image captioning and have created some great datasets and models, such as BLIP by SalesForce, GIT by Microsoft and many more. These models can be used to generate captions for images, which can then be used to create text-image pairs for your own multi-modal dataset!
When building your dataset, the images are the main component, since they are the starting point for getting text-image pairs. One way of getting your dataset is by using a ready-to-go dataset, such as your own private dataset or a public dataset.
Where to find a ready-to-go image dataset:
- https://huggingface.co/docs/datasets/index
- https://pytorch.org/vision/stable/datasets.html
- https://www.kaggle.com/datasets
However, if you want some more specific data, this is not always possible. Luckily, LAION has invested a lot of brain power and resources to open source some great tools and data such as LAION-5B and clip-retrieval. They built the LAION-5B dataset by scraping and filtering Common Crawl in a smart way (using CLIP and filters) and compiled it into a FAISS Semantic Search index. This index can be used to retrieve images based on a visual and textual input, which results in an incredible powerful and efficient way of getting images for your dataset.
To explore the LAION-5B dataset you can use the clip frontend website.
For retrieving images, you need to have a small set of textual descriptions or example images. The LAION-5B dataset will then retrieve the URLs of the most similar images based on the CLIP embeddings of the input. These URLs can then be used to download the actual images.
If you want to test out Stable Diffusion, you can use the following demos:
The image below shows the entire pipeline and its workflow. Note that this workflow is currently adapted to the interior design domain, but can be easily adapted to other domains by changing the prompt generation component.
There are 5 components in total, these are:
-
Load from hub: This component loads a dataset from the Hugging Face dataset hub. It takes in a Hugging Face dataset ID, so it can use any Hugging Face dataset (or a custom one).
-
Image embedding: This component takes in the images of the dataset and returns the CLIP embeddings of these images. These embeddings can be used for filtering the dataset or for retrieving similar images from the LAION-5B dataset.
-
Image URL Retrieval: This component retrieves images from the LAION-5B dataset based on the seed prompts. The retrieval itself is done based on CLIP embeddings similarity between the prompt embedding and the visual embeddings in the LAION dataset. This component doesn’t return the actual images yet, only the URLs. The next component in the pipeline will then download these images.
-
Download Images: This component downloads the actual images based on the URLs retrieved by the previous component. It takes in the URLs as input and returns the actual images, along with some metadata (like their height and width).
-
Add Captions: This component captions all images using BLIP. This model takes in the image and generates a caption that describes the content of the image. This component takes in a Hugging Face model ID, so it can use any Hugging Face Hub model.
pip install fondant[pipelines]
The pipeline_configs.py contains a data class used to store two general configuration parameters for the pipeline.
BASE_PATH
: This base path used to store the artifactsHOST
: This is the Kubeflow pipelines host url
Both of these need to be set to suitable values.
Running the pipeline then consists of two steps:
- Building the images for each of the pipeline components
bash build_images.sh --namespace <NAMESPACE> --repo <REPO> -C all
For help with the build_images.sh script, run:
bash build_images.sh --help
- Running the pipeline:
python pipeline.py
You can reuse this pipeline by adapting the following components:
- Load from hub: This component loads a dataset from the Hugging Face dataset hub. You can change this dataset to any other compatible dataset on the Hugging Face dataset hub. If the image column of the dataset is called something else than
image
, you can adapt this naming in the code of the Load From Hub component.
Feel free to swap out components of the pipeline with our other components, try building your own custom filtering component or use different captioning models than our example! Let us know if you have any questions or feedback, we are happy to help!