Add support for image upload for image generation RLHF purposes and image2text captioning #3453
Replies: 5 comments
-
A few references. Visual ChatGPT: https://github.com/microsoft/visual-chatgpt Dumping a few relevant papers: Prompt-to-Prompt Image Editing with Cross Attention Control InstructPix2Pix: Learning to Follow Image Editing Instructions Generalized Decoding for Pixel, Image, and Language / Instruct-X-Decoder Null-text Inversion for Editing Real Images using Guided Diffusion Models Zeroth-Order Optimization Meets Human Feedback: Provable Learning via Ranking Oracles / Taming Stable Diffusion with Human Ranking Feedback Aligning Text-to-Image Models using Human Feedback GLIGEN: Open-Set Grounded Text-to-Image Generation Plans from CarperAI for "Instruct Diffusion": |
Beta Was this translation helpful? Give feedback.
-
Workflow from Microsoft's Visual ChatGPT: Models they used:
New approach: X-Decoder |
Beta Was this translation helpful? Give feedback.
-
Datasets for image captioning, image interaction (could be applied to Instruction datasets) and OCR datasets (handwriting and image to LaTeX): https://github.com/google-research-datasets/wit Update from a contributor on Discord: Good sources that could be used as instructional datasets: Wikimedia commons For multimodal requests in general: In future for WebGPT: Other, interesting for current Open Assistant: |
Beta Was this translation helpful? Give feedback.
-
ViperGPT - New framework for visual inference: https://viper.cs.columbia.edu/ 8mb.video-mDT-iFzp2uta.mp4 |
Beta Was this translation helpful? Give feedback.
-
Hey me and https://github.com/P1ayer-1 are currently working on a neural interface for quickly accessing tweets and reddits posts (https://github.com/Simon1V/reddit-scraper/tree/main (will be renamed to NeuralWebInterface or something of that sort), It's supposed to be initially used for training data to prevent csam and such) but it wont be an issue to harvest tweet data including images. Maybe exploiting this |
Beta Was this translation helpful? Give feedback.
-
Greetings,
I know text-based content is the priority right now for the ongoing dataset collection, but I strongly believe in the importance of maintaining an instruction-based image generation dataset, for RLHF. A separate section could be made on the OA page specifically for that purpose (as many users are not willing to contribute for that at this moment, but given the complexity, I believe this is something we should start right now).
I believe many Open Assistant contributors like me are Stable Diffusion enthusiasts, and would be willing to participate in such endeavor.
Also, there are image captioning technologies (image-to-text) such as BLIP that could be fine-tuned and further improved with reinforcement learning from human users, and I believe the dataset collection environment Open Assistant provides could be useful for that, to avoid inaccuracies.
Here are some example use cases:
1. Regular image generation
A straightfoward process. The users requests an image, and the AI will provide it, and then the user will give further feedbacks.
Examples: "Make a futuristic landscape by Monet". "Generate an image that represents this poem". "Provide a very realistic photo of a happy family". "Make an 80s dark fantasy film still of the Avengers".
2. Image manipulation/modification
Users will criticize the assistant-generated image or request for additional changes, and then the AI should provide a new generation with modifications as requested by the user. Instructpix2pix exists for Stable Diffusion, however, its accuracy could be improved.
Examples: The users would then provide feedback from previously generated images: "This image is not realistic enough!". "Could you make the subject's distance a little longer? It's too close". "I don't like this picture, make me three more and let me decide". "The hands of the subject in this picture has additional fingers, please fix that". "The face in that picture looks ugly, make a better one".
Additionally, if users could upload images themselves, it would be interesting for these scenarios:
"Change the subject's shirt to red". "Transform the photo into a painting by artist X". "Transform the Golden Retriever dog into a Labrador Retriever". "Remove the black cat from the image". "Crop the picture capturing only the cat". "Resize the image by half". "Upscale the image three times; should make use of an upscaling algorithm and not stretch it". "Add a dog to the picture, side by side with the cat". "Colorize this really old picture, and remove scratches; remaster it".
3. Image context analysis
Users will provide an image, then ask something about it. By making use of image-to-text tools (not needed during the dataset creation process), the Assistant will be able to understand the context of the image and discuss it with the user.
Examples: "Describe what is happening in this picture". "Role-play as the person in the image, and talk about yourself". "Tell me the price of this car in the picture". "How many people are there in the picture?".
4. OCR
The Assistant would be able to perform optical recognition tasks, like "reading" screenshots, being able to extract text from them and then make comments about it as the user requests.
Examples: "Translate this japanese text in the image into english". "Present the data on this screenshot as a CSV-formatted table". "Transform this table into a SQL query"."Tell me what this person meant in this Twitter screenshot". "Provide the OCR transcribed text for this ad".
Technologies are already available to make each one of these steps possible, so contributors would not have as much of a hard time replying as assistant.
I am aware that the project intends to use APIs or something like Toolformer, but I believe training this kind of feature with human feedback is extremely important. Unfortunately, despite the existence of these tools, the accuracy of the results are hit-or-miss, and could be improved a lot with a dataset with human feedback for RLHF.
Beta Was this translation helpful? Give feedback.
All reactions