Add support for image upload for image generation RLHF purposes and image2text captioning #3453

Subarasheese · 2023-03-05T21:09:11Z

Subarasheese
Mar 5, 2023

Greetings,

I know text-based content is the priority right now for the ongoing dataset collection, but I strongly believe in the importance of maintaining an instruction-based image generation dataset, for RLHF. A separate section could be made on the OA page specifically for that purpose (as many users are not willing to contribute for that at this moment, but given the complexity, I believe this is something we should start right now).
I believe many Open Assistant contributors like me are Stable Diffusion enthusiasts, and would be willing to participate in such endeavor.
Also, there are image captioning technologies (image-to-text) such as BLIP that could be fine-tuned and further improved with reinforcement learning from human users, and I believe the dataset collection environment Open Assistant provides could be useful for that, to avoid inaccuracies.

Here are some example use cases:

1. Regular image generation

A straightfoward process. The users requests an image, and the AI will provide it, and then the user will give further feedbacks.

Examples: "Make a futuristic landscape by Monet". "Generate an image that represents this poem". "Provide a very realistic photo of a happy family". "Make an 80s dark fantasy film still of the Avengers".

2. Image manipulation/modification

Users will criticize the assistant-generated image or request for additional changes, and then the AI should provide a new generation with modifications as requested by the user. Instructpix2pix exists for Stable Diffusion, however, its accuracy could be improved.

Examples: The users would then provide feedback from previously generated images: "This image is not realistic enough!". "Could you make the subject's distance a little longer? It's too close". "I don't like this picture, make me three more and let me decide". "The hands of the subject in this picture has additional fingers, please fix that". "The face in that picture looks ugly, make a better one".
Additionally, if users could upload images themselves, it would be interesting for these scenarios:
"Change the subject's shirt to red". "Transform the photo into a painting by artist X". "Transform the Golden Retriever dog into a Labrador Retriever". "Remove the black cat from the image". "Crop the picture capturing only the cat". "Resize the image by half". "Upscale the image three times; should make use of an upscaling algorithm and not stretch it". "Add a dog to the picture, side by side with the cat". "Colorize this really old picture, and remove scratches; remaster it".

3. Image context analysis

Users will provide an image, then ask something about it. By making use of image-to-text tools (not needed during the dataset creation process), the Assistant will be able to understand the context of the image and discuss it with the user.

Examples: "Describe what is happening in this picture". "Role-play as the person in the image, and talk about yourself". "Tell me the price of this car in the picture". "How many people are there in the picture?".

4. OCR

The Assistant would be able to perform optical recognition tasks, like "reading" screenshots, being able to extract text from them and then make comments about it as the user requests.

Examples: "Translate this japanese text in the image into english". "Present the data on this screenshot as a CSV-formatted table". "Transform this table into a SQL query"."Tell me what this person meant in this Twitter screenshot". "Provide the OCR transcribed text for this ad".

Technologies are already available to make each one of these steps possible, so contributors would not have as much of a hard time replying as assistant.

I am aware that the project intends to use APIs or something like Toolformer, but I believe training this kind of feature with human feedback is extremely important. Unfortunately, despite the existence of these tools, the accuracy of the results are hit-or-miss, and could be improved a lot with a dataset with human feedback for RLHF.

Subarasheese · 2023-03-07T21:28:05Z

Subarasheese
Mar 7, 2023
Author

A few references.

Visual ChatGPT:

https://github.com/microsoft/visual-chatgpt

Dumping a few relevant papers:

Prompt-to-Prompt Image Editing with Cross Attention Control
https://arxiv.org/abs/2208.01626

InstructPix2Pix: Learning to Follow Image Editing Instructions
https://arxiv.org/abs/2211.09800

Generalized Decoding for Pixel, Image, and Language / Instruct-X-Decoder
https://arxiv.org/abs/2212.11270

Null-text Inversion for Editing Real Images using Guided Diffusion Models
https://arxiv.org/abs/2211.09794

Zeroth-Order Optimization Meets Human Feedback: Provable Learning via Ranking Oracles / Taming Stable Diffusion with Human Ranking Feedback
https://github.com/TZW1998/Taming-Stable-Diffusion-with-Human-Ranking-Feedback

Aligning Text-to-Image Models using Human Feedback
https://arxiv.org/abs/2302.12192

GLIGEN: Open-Set Grounded Text-to-Image Generation
https://arxiv.org/pdf/2301.07093.pdf

Plans from CarperAI for "Instruct Diffusion":
https://docs.google.com/document/d/1tFsstG6UtzXCSrIiAiQw90iOqEa3fwWMDWCQ7nlMaKI/edit

0 replies

Subarasheese · 2023-03-12T03:39:22Z

Subarasheese
Mar 12, 2023
Author

Workflow from Microsoft's Visual ChatGPT:

Models they used:

Foundation Model	Memory Usage (MB)	Description
ImageEditing	6667	A model that can edit images by changing their style, color, contrast, etc.
ImageCaption	1755	A model that can generate captions for images based on their content.
T2I	6677	A model that can generate realistic images from text descriptions.
canny2image	5540	A model that can generate images from canny edge maps.
line2image	6679	A model that can generate images from line drawings.
hed2image	6679	A model that can generate images from holistically-nested edge maps.
scribble2image	6679	A model that can generate images from scribbles.
pose2image	6681	A model that can generate images of human figures from pose keypoints.
BLIPVQA	2709	A model that can answer questions about images based on their visual content.
seg2image	5540	A model that can generate images from semantic segmentation maps.
depth2image	6677	A model that can generate images from depth maps.
normal2image	3974	A model that can generate images from normal maps.
InstructPix2Pix	2795	A model that can generate images from natural language instructions.

New approach: X-Decoder
It can use a single decoder model for a variety of tasks instead of separate ones.

https://github.com/microsoft/X-Decoder/tree/xgpt

0 replies

Subarasheese · 2023-03-14T21:22:41Z

Subarasheese
Mar 14, 2023
Author

Datasets for image captioning, image interaction (could be applied to Instruction datasets) and OCR datasets (handwriting and image to LaTeX):

https://github.com/google-research-datasets/wit
https://visualqa.org/
https://paperswithcode.com/dataset/iam
https://paperswithcode.com/dataset/im2latex-100k

Update from a contributor on Discord:

Good sources that could be used as instructional datasets:

Wikimedia commons
/r/photoshopbattles
/r/photoshoprequests
/r/askscience
/r/whatisthisthing

For multimodal requests in general:
/r/DrawingPrompts/
/r/SketchDaily/
/r/dailydraw/
/r/learnart/
/r/ArtBattle/
/r/DrawForMe/
/r/onedrawingdaily/
/r/ArtCrit/
/r/ArtBuddy/
/r/comic_crits/
/r/redditgetsdrawn/
/r/IDAP/
/r/mspaintbattles/
/r/idesignedthis/
/r/ICanDrawThat/
/r/whatsthisbug/
/r/whatsthisbird/
/r/whatsthisplant/
/r/whatsthisrock/
/r/whatisthispainting/
/r/whereisthis/
/r/TraceAnObject/

In future for WebGPT:
/r/HelpMeFind/
/r/findfashion/

Other, interesting for current Open Assistant:
/r/tipofmyjoystick/

0 replies

Subarasheese · 2023-03-18T04:28:00Z

Subarasheese
Mar 18, 2023
Author

ViperGPT - New framework for visual inference:

https://viper.cs.columbia.edu/

8mb.video-mDT-iFzp2uta.mp4

0 replies

Simon1V · 2023-03-18T07:31:31Z

Simon1V
Mar 18, 2023

Hey me and https://github.com/P1ayer-1 are currently working on a neural interface for quickly accessing tweets and reddits posts (https://github.com/Simon1V/reddit-scraper/tree/main (will be renamed to NeuralWebInterface or something of that sort), It's supposed to be initially used for training data to prevent csam and such) but it wont be an issue to harvest tweet data including images. Maybe exploiting this
https://arxiv.org/abs/2302.04761
it can be generalized into a framework that would allow requests like
"Hey Open Assistent, summarize the latest 10 tweets from Nvidia"

0 replies

Add support for image upload for image generation RLHF purposes and image2text captioning #3453

Uh oh!

Uh oh!

Subarasheese Mar 5, 2023

Replies: 5 comments

Uh oh!

Uh oh!

Subarasheese Mar 7, 2023 Author

Uh oh!

Uh oh!

Subarasheese Mar 12, 2023 Author

Uh oh!

Uh oh!

Subarasheese Mar 14, 2023 Author

Uh oh!

Subarasheese Mar 18, 2023 Author

Uh oh!

Uh oh!

Simon1V Mar 18, 2023

Subarasheese
Mar 5, 2023

Subarasheese
Mar 7, 2023
Author

Subarasheese
Mar 12, 2023
Author

Subarasheese
Mar 14, 2023
Author

Subarasheese
Mar 18, 2023
Author

Simon1V
Mar 18, 2023